[HN Gopher] Trade-Offs in Automatic Differentiation: TensorFlow,...
___________________________________________________________________
Trade-Offs in Automatic Differentiation: TensorFlow, PyTorch, Jax,
and Julia
Author : ChrisRackauckas
Score : 175 points
Date : 2021-12-25 11:50 UTC (11 hours ago)
(HTM) web link (www.stochasticlifestyle.com)
(TXT) w3m dump (www.stochasticlifestyle.com)
| carterschonwald wrote:
| Part of the challenge is that most formulations of (reverse mode
| )auto diff wind up requiring extra runtime data structures for
| the backwards computation step.
|
| There's been some great work in this space in the past 5 years.
|
| I've got some stuff I worked out this fall I'm overdue to write
| up and share some prototypes : there is a way to do reverse mode
| auto diff isolated to just being an invisible compiler pass!
| Without any of the extra complexity in what are otherwise
| equivalent formulations
| The_rationalist wrote:
| https://github.com/breandan/kotlingrad#coroutines
| snicker7 wrote:
| Even though this posts thesis is "trade offs", it doesn't really
| talk about any technical advantages that the Python's AD
| ecosystem (Tensorflow, PyTorch, JAX) has over Julia's (Zygote.jl,
| Diffractor.jl).
| adgjlsfhk1 wrote:
| It touches on the main one which is simplicity. it's much
| easier to write an AD system for a more static language.
| civilized wrote:
| Maybe it has no technical advantages, unless you count being
| very popular and in an accessible language as a technical
| advantage (which it definitely could be depending on your
| definition of "technical").
|
| Julia is designed for advanced numerical computing and Python
| isn't. The metaprogramming affordances needed for AD are much
| better developed in Julia than they ever will be in Python. And
| let's not forget the immense utility of multiple dispatch in
| Julia, another feature Python will probably never have. So it's
| not surprising that Julia is simply way more capable.
| jaggirs wrote:
| One disatvantage of the language itself is the need for
| compilation, which isn't that fast in my limited experience.
| But I would love to hear how much this affects iteration
| speed.
| calaphos wrote:
| The same issue exists with Jax. XLA compilation can take up
| quite a bit of time, especially on larger NN models. And
| theres no persistent compile cache, so even if you don't
| change the jitted function you need to wait for compilation
| again as you restart the process.
| alevskaya wrote:
| Jax does actually already support a persistent
| compilation cache for TPU, and support for caching GPU
| compiles is being worked on currently.
| civilized wrote:
| Yeah, I'd imagine that for things both Python and Julia can
| do AD-wise, Python may be preferable since it's interpreted
| and thus instant-feedback, but all the numerical heavy
| lifting in packages like Jax and PyTorch is done in fast
| C++. So you should be getting a more appealing environment
| for experimentation without losing out on speed.
| mountainriver wrote:
| The Julia crowd touting multiple dispatch all the time is so
| strange, it's actually one of the main reasons the language
| hasn't had much uptake from what I can tell.
|
| Python is just more approachable and natural to people. Julia
| should learn from that
| civilized wrote:
| Python is just more OO style, so people who have been
| taught OOP in school are comfortable with it. That will
| include the vast majority of generic SWEs writing generic
| CRUD apps.
|
| But personally I find OOP ugly and unnatural, and Julia's
| model elegant and natural. And far more powerful - Julia
| programmers are using multiple dispatch to build out
| scientific computing to a sophistication not seen in any
| other language.
|
| It might not be your cup of tea if you need to see
| object.method() in your code, but if you're more mentally
| flexible and want to build the next generation of technical
| computing tools, Julia is the place to be right now.
| mountainriver wrote:
| Yeah I'm definitely mentally flexible and have coded in
| many paradigms, I don't love OO and generally don't write
| that way but multiple dispatch as a primary design
| pattern is odd.
|
| I've tried it for close to a year and the ergonomics
| still felt off, it reminds me of how the scala crowd
| talked about functional programming, and we've seen how
| that turned out.
|
| I hear this from a lot of people that try Julia and yet
| the Julia crowds answer is always that they are dumb.
| Sounds a lot like the scala crowd...
| civilized wrote:
| I think 90% of the ergonomics issue is that people want
| dot notation and tab-autocomplete in their IDE so they
| can type obj.<tab> and get the methods that operate on
| obj. Which I agree, some version of that should exist,
| and there's no real reason it can't exist in Julia. The
| tooling is just not as mature as other languages.
|
| Julia is far ahead in affordances to write fancy
| technical code and fairly behind in simple things, like
| standard affordances to write more ordinary code, or the
| ability to quickly load in data and make a plot.
|
| I just think it's a misdiagnosis to blame multiple
| dispatch for this issue. It's much more about the Julia
| community prioritizing the needs of their target market.
| cl3misch wrote:
| > fun fact, the Jax folks at Google Brain did have a Python
| source code transform AD at one point but it was scrapped
| essentially because of these difficulties
|
| I assume you mean autograd?
|
| https://github.com/HIPS/autograd
| ChrisRackauckas wrote:
| No, autograd acts similarly to PyTorch in that it builds a tape
| that it reverses while PyTorch just comes with more optimized
| kernels (and kernels that act on GPUs). The AD that I was
| referencing was tangent (https://github.com/google/tangent). It
| was an interesting project but it's hard to see who the
| audience is. Generating Python source code makes things harder
| to analyze, and you cannot JIT compile the generated code
| unless you could JIT compile Python. So you might as well first
| trace to a JIT-compliable sublanguage and do the actions there,
| which is precisely what Jax does. In theory tangent is a bit
| more general, and maybe you could mix it with Numba, but then
| it's hard to justify. If it's more general then it's not for
| the standard ML community for the same reason as the Julia
| tools, but then it better do better than the Julia tools in the
| specific niche that they are targeting. That generality means
| that it cannot use XLA, and thus from day 1 it wouldn't get the
| extra compiler optimizations that some which uses XLA does
| (Jax). Jax just makes much more sense for the people who were
| building it, it chose its niche very well.
| brilee wrote:
| FYI - Tangent evolved into TF2's AutoGraph.
| fault1 wrote:
| it's quite interesting how at least in ML, the transformer
| architecture has 'won out', at least for the time being, it
| appears to be everywhere these days:
| https://threadreaderapp.com/thread/1468370605229547522.html
|
| the advantage of transformers (computationally) seems to be how
| little sophistication the attention mechanism needs from AD
| systems (and how well it appears to scale with data). it's also a
| very static architecture in terms of a data flow/control flow
| perspective.
|
| as far as I understand, this is far different from systems
| needing to be modeled in continuous time, especially things like
| SDEs. I am curious if things like delay embeddings will ever be
| modeled in terms of mechanisms similar to attention however.
| liuliu wrote:
| Like other comments, CNNs and LSTMs are still in wide use
| today. If you dig deep enough, position encoding doesn't really
| capture time-based series information that well.
| blovescoffee wrote:
| Could you elaborate? I've built some Causal CNN's but never
| used a transformer for time series data. What are the
| challenges?
| jowday wrote:
| Outside of research transformers are rarely used for computer
| vision problems and CNNs remain the go to architecture. And you
| actually need to do some hacks to get transformers to work with
| computer vision at a meaningful scale (splitting images into
| patches and convoluting the patches to produce features to feed
| into the transformer).
| fault1 wrote:
| > some hacks to get transformers to work with computer vision
| at a meaningful scale (splitting images into patches and
| convoluting the patches to produce features to feed into the
| transformer).
|
| sounds a lot like 'classical computer vision'. e.g, when I
| learned the subject (mid 2000s), topological features were
| all the rage: https://en.wikipedia.org/wiki/Digital_topology
| xiphias2 wrote:
| It will be interesting when Tesla and Waymo moves to
| transformer architecture, but as you wrote my guess is that
| it's not yet in production for vision tasks.
| jowday wrote:
| I'm not sure they will, at least not with the research in
| the state it is presently. Researchers are interested in
| vision transformers because they're competitive with CNNs
| if you give them enough training data - they don't
| drastically outperform them.
|
| Right now switching over to them would require a ton of
| code changes, relearning intuitions, debugging, profiling,
| etc. for not a ton of benefit.
| xiphias2 wrote:
| Sure, I think the same, but the tweets came from Andrew
| Karpathy, he's watching this space like an eagle.
| liuliu wrote:
| Tesla did, as mentioned in their AI Day. It is not full
| transformer (aka ViT). The use transformer decoder to
| synthesize data from different cameras and decode 3d
| coordinates directly (aka DETR).
| xiphias2 wrote:
| Thanks, sounds great, I'll read the DETR paper
| joconde wrote:
| I've looked into transformers for semantic segmentation, but
| the patching aspect seems to make it hard too. Do you have
| some sources that describe these hacks in detail?
| lowdose wrote:
| You could do a code search on GitHub. I'm pretty lazy in
| the aspect of coding. I always seem to find a repo that has
| implemented an MVP with what I already had in mind. There
| are some gold nuggets on GitHub like Googles DDSP
| implementation they have academically published anonymous.
| The_rationalist wrote:
| Kotlingrad make any autodiff library looks pale in comparison..
| mark_l_watson wrote:
| Interesting read. I was very disappointed when the Swift
| TensorFlow project withered away. A good general purpose
| programming language combined with deep learning seemed like a
| great idea. Apple actually provides a good dev experience with
| Swift and CoreML (for fun I wrote a Swift/SwiftUI/CoreML app that
| uses two deep learning models that is in the App Store).
|
| Wolfram Language takes an approach similar to Apple's in
| providing a good number of pre trained models, but I haven't yet
| discovered any automatic differentiation examples.
|
| Of the frameworks described in the article, I find Julia most
| interesting but I need to use Python and TensorFlow in my work.
| ChrisRackauckas wrote:
| I remember reading an early S4TF manifesto talking about
| natural language processing, image processing, etc. thinking,
| that cannot be your audience because that audience already has
| AD systems which support their domain. Building something that
| is more general for the sake of being more general is never a
| good idea, that's bad engineering. It had a lot of great ideas,
| and indeed the dev experience seemed nice. But I would venture
| to guess that the Google overlords had to question what the
| true value of S4TF in that light. "Standard ML" cannot be your
| target audience if you want to work on AD extensions.
|
| Following this thread, you can also see what how the Julia
| tools evolved. If you see the paper that was the synthesis for
| Zygote.jl, it was all about while loops and scalar operations
| (https://arxiv.org/abs/1810.07951). Why did that not completely
| change ML? Well, ML doesn't use those kinds of operations. I
| would say the project kind of started as a "tool looking for a
| problem". It did get a bit lucky that it found a problem:
| scientific applications need to be able to use automatic
| differentiation without rewriting the whole codebase to an ML
| library, leading to the big Julia AD manifesto of language-wide
| differentiable programming by directly acting on the Julia
| source itself rather than a language subset (http://ceur-
| ws.org/Vol-2587/article_8.pdf). Zygote was a good AD, but not a
| great AD, why? Because it could not hit this goal, mostly
| because of its lack of mutation handling. Yes, it does handle
| standard ML just fine, but does not justify its added
| complexity.
|
| What has actually kept Julia AD research going is that some
| scientific machine learning applications, specifically physics-
| informed neural networks (PINNs), require very high order
| derivatives. For example, to solve the PDE u_t = u_xx with
| neural networks, you need to take the third derivative of the
| neural network. With Jax this can only be done with a separate
| language subset (https://openreview.net/pdf?id=SkxEF3FNPH), and
| thus a new AD for Julia to replace Zygote, known as
| Diffractor.jl, was devised to automatically incorporate higher
| order AD optimizations as part of the regular usage
| (https://www.youtube.com/watch?v=mQnSRfseu0c). It is these PINN
| SciML applications that have funded its development and is its
| built-in audience: it solves a problem nothing else does, even
| if it is potentially niche. Similarly with Enzyme, it solved
| the problem of how to do mutation well, which is where you can
| see in the paper that its applications are mostly ODE and PDE
| solvers (Euler, RK4, the Bruss semilinear PDE) (https://proceed
| ings.neurips.cc/paper/2020/file/9332c513ef44b...). Torchscript
| and Jax do not handle this domain well, so it has an audience,
| which may (or may not?) be niche.
|
| A big part of writing this blog post was to highlight this to
| the Julia AD crew that I regularly work with. What will keep
| these projects alive is understanding the engineering trade-
| offs that are made and who the audience is. The complexity has
| a cost so it better have a benefit. If that target is lost, if
| any benefit is a theoretical "but you may need more features
| some day", then the projects will lose traction. The project
| needs to be two-fold: identify new architectures and
| applications that would benefit from expanded language support
| from AD and build good support for those projects. Otherwise it
| is just training a transformer in Julia vs training a
| transformer in Python, and that is not justifiable.
| p1esk wrote:
| My impression from your comment is that you don't care that
| much about "standard" ML users. As a "standard" ML user
| (pytorch/jax), and a potential Julia user in the future, this
| is not what I like to hear.
| borodi wrote:
| The idea, I imagine, is to differentiate what the julia ML
| stack offers over what is already on python, if it offers
| the same thing, but without the funding from facebook or
| google, why bother switching? It has to offer something
| more.
| The_rationalist wrote:
| Note however that Kotlin is backed by Facebook
| https://ai.facebook.com/blog/paving-the-way-for-software-20-...
| and that there is a mature library for autodiff
| https://github.com/breandan/kotlingrad
| 2sk21 wrote:
| I am really glad to hear this. As it happens, my main post-
| retirement has been to learn Swift to try out CoreML. I'm
| really enjoying learning this so far.
| hnarayanan wrote:
| This is really interesting to me. Could you please share your
| learning pathway?
| albertzeyer wrote:
| This misses some discussion on tf.function which does a Python
| AST level transformation to a static TF computation graph,
| including dynamic control flow like loops and conditional
| branches.
| chillee wrote:
| Tf.function is largely morally equivalent to Torchscript, which
| he does discuss.
| spacetracks wrote:
| "The second factor, and probably the more damning one, is that
| most ML codes don't actually use that much dynamism." I would
| argue that this in true precisely because it is not available in
| an AD system. When I tell friends and coworkers about what zygote
| can do they light up and start describing different use cases
| they have that could benefit from AD. Diff eq solving is a big
| one.
| KKKKkkkk1 wrote:
| This is because continuous optimization is useless when
| crossing a discontinuity, which is what control flow creates.
| Even in a trivial situation like ReLU, where the control flow
| is mimicking a continuous transition, you have the "dead ReLU"
| problem, where you have to start training on the correct side
| of the discontinuity and make sure to never cross.
| ogogmad wrote:
| I don't know whether this belongs here, but...
|
| Formally, there is a generalisation of differentiation which
| can handle functions like ReLU (i.e. locally Lipschitz non-
| differentiable functions) by allowing a derivative to be set-
| valued. It's called the _Clarke gradient_. The Clarke
| gradient of ReLU at 0 is the closed interval [0,1]. Note that
| the Clarke gradient doesn 't satisfy the chain rule (except
| in a weakened form) which might seriously mess up some
| assumptions about autodiff. Is this generalised derivative
| useful in autodiff?
|
| I imagine that this is a largely theoretical tool that's
| useful in analysing algorithms but useless for actually
| computing things.
| medo-bear wrote:
| i havent heard of clarke gradient before. convex analysis
| has something called subgradient [0], is it different?
|
| [0] https://en.m.wikipedia.org/wiki/Subderivative
| ogogmad wrote:
| The subgradient in convex analysis is a special case of
| the Clarke gradient. The subgradient is precisely the
| Clarke gradient for convex functions. Convex functions
| are always locally Lipschitz except in weird cases.
|
| [edit]
|
| Question: Are there numerical applications in which the
| subgradient is actually computed, or is it a purely
| analytical tool?
| agnosticmantis wrote:
| (Stochastic) subgradient methods are used in practice to
| optimize non-differentiable convex functions. They have a
| slower convergence rate than (stochastic) gradient
| descent though.
|
| See for example : https://www.stat.cmu.edu/~ryantibs/conv
| exopt-F15/lectures/07...
| fault1 wrote:
| yes, see: https://juliadiff.org/ChainRulesCore.jl/dev/mat
| hs/nondiff_po...
|
| and related neat usages in set based optimization methods
| in MathOptInterface (part of JuMP.jl):
| https://matbesancon.xyz/post/2020-12-24-chains_sets2/
| sockfish wrote:
| I often wonder why so much effort is being put into shoehorning
| everything into a single language. Wouldn't it make much more
| sense to use a fully differentiable DSL for machine learning /
| xla, then call it from whatever host language you use? This
| approach has worked really well for SQL for the past couple of
| decades.
| woadwarrior01 wrote:
| You might like dex[1].
|
| [1]: https://github.com/google-research/dex-lang
| niklasd wrote:
| Has it worked really well? I feel ORMs are a sign it hasn't.
| Though I really enjoy having learned SQL and being able to
| interact with almost all relational databases.
| handzhiev wrote:
| Imo ORM is mostly a sign that (for some odd reason) many
| developers don't want to learn / use SQL.
|
| But what actual problem is ORM solving beyond that?
| viraptor wrote:
| They solve 99% of your queries in much less time while
| allowing to drop down to SQL when you really want/need it.
| dnautics wrote:
| ORMs solve "don't accidentally introduce an SQL injection"
| mbStavola wrote:
| Neither do prepared statements
| baq wrote:
| Boilerplate. Writing serializers and deserializers by hand
| is not an efficient use of developer time.
|
| Related to ORMs, but not quite on topic - query building.
| Type checked queries, parts of which can be passed around
| business logic, are very powerful and flexible.
| ithkuil wrote:
| There are more and more libraries that let you to write
| SQL and bind the results into native records (objects,
| structs) in the host language. I find it an interesting
| middle ground
| vkkhare wrote:
| What do people think of automatic differentiation support
| facebook was trying for Kotlin.
|
| They called it differentiable programming
| https://ai.facebook.com/blog/paving-the-way-for-software-20-...
| secondcoming wrote:
| This [0] video has a quite interesting proposal to add AD to
| compilers.
|
| [0] https://youtu.be/1QQj1mAV-eY
___________________________________________________________________
(page generated 2021-12-25 23:00 UTC)