[HN Gopher] PyTorch 2.0
___________________________________________________________________
PyTorch 2.0
Author : lairv
Score : 240 points
Date : 2022-12-02 16:17 UTC (6 hours ago)
(HTM) web link (pytorch.org)
(TXT) w3m dump (pytorch.org)
| singularity2001 wrote:
| Not available for Mac M1 yet (?)
| amelius wrote:
| One thing I'm noticing lately is that these DL libraries and
| their supporting libraries are getting unwieldy large, and
| difficult to version-manage.
|
| In my mind, DL is doing little more than performing some inner-
| products between tensors, so I'm curious why we should have
| libraries such as libcudnn, libcudart, libcublas, torch, etc.
| containing gigabytes of executable code. I just checked and I
| have 2.4GB (!!) of cuda-related libraries on my system, and this
| doesn't even include torch.
|
| Also, going to a newer minor version of e.g. libcudnn might cause
| your torch installation to break. Why isn't this backward
| compatible?
| modeless wrote:
| The complexity of deep learning algorithms is low but the
| complexity of the _hardware_ is high. The problem solved by
| these gigabytes of libraries is getting peak utilization for
| simple algorithms on complex and varied hardware.
|
| CuDNN is enormous because it embeds precompiled binaries of
| many different compute kernels, times many variations of each
| kernel specialized for different data sizes and/or fused with
| other kernels, and again times several different GPU
| architectures.
|
| If you don't care about getting peak utilization of your
| hardware you can run state of the art neural nets with a truly
| tiny amount of code. The algorithms are so simple you don't
| even need any libraries, it's easy enough to write everything
| from scratch even in low level languages. It's a fun exercise.
| But it will be many orders of magnitude less efficient so
| you'll have to wait a really long time for it to run.
| amelius wrote:
| Ok, is there any way to trim down the amount of code used
| without reducing the performance of my particular
| application, and my particular machine?
|
| I have the feeling that it's an all-or-nothing proposition.
| Either you have a simple CPU-only algorithm, or you have
| several gigabytes of libraries you don't really need.
|
| Also, in some applications I would be willing to give up 10%
| of performance if I could reclaim 90% of space.
| modeless wrote:
| CuDNN is only for Nvidia GPUs, and those machines generally
| have decent sized disks and decent network connections so
| no nobody cares about a few GBs of libraries. There are
| alternatives to using CuDNN with much smaller binary size.
| Maybe they can match or beat it or maybe not, depending on
| your model and hardware. But you'll have to do your own
| work to switch to them, since most people are happy enough
| with CuDNN for now.
|
| The real problem with deep learning on Nvidia is the Linux
| driver situation. Ugh. Hopefully one day they will come to
| their senses.
| amelius wrote:
| It's not just disk size. Also memory size, and loading
| speed.
|
| Yes, I agree about the driver situation.
| modeless wrote:
| The disk size of the shared library is not indicative of
| RAM usage. Shared libraries are memory mapped with demand
| paging. Only the actually used parts of the library will
| be loaded into RAM.
| claytonjy wrote:
| While I think you raise important points about the dominance
| of hardware optimizations, I think you're massively
| overstating the simplicity of the algorithms.
|
| Sure, it's easy to code the forward pass of a fully connected
| neural network, but writing code to train a useful modern
| architecture is a very different endeavor.
| brrrrrm wrote:
| I disagree, the burden is almost exclusively maintaining
| fast implementations of primitive operators for all
| hardware. These ML libraries are collections of pure
| functions with minimal interfaces. There's very little code
| interdependence and it's not particularly difficult to
| implement modern algorithms to train networks.
|
| full stable diffusion in <800 lines: https://github.com/geo
| hot/tinygrad/blob/4fb97b8de0e210cc3778...
|
| autograd in <30 lines: https://github.com/geohot/tinygrad/b
| lob/4fb97b8de0e210cc3778...
|
| Adam in <20 lines: https://github.com/geohot/tinygrad/blob/
| 4fb97b8de0e210cc3778...
| modeless wrote:
| I disagree. I mean it's not trivial but it is completely
| within reach of a single person. The only part you'd really
| need to lean on libraries for would be data loading (e.g.
| jpeg). The core neural net stuff really is not that
| complex, even in the latest architectures like transformers
| or diffusion models. Look at stuff like George Hotz's
| tinygrad or Andrej Karpathy's makemore.
| poorman wrote:
| Sill no support for/from Apple Silicon?
| cube2222 wrote:
| It's supported since 1.12[0], no?
|
| [0]: https://pytorch.org/blog/introducing-accelerated-pytorch-
| tra...
| chadykamar wrote:
| It's also officially in beta as of 1.13
| https://pytorch.org/blog/PyTorch-1.13-release/#beta-
| support-...
| [deleted]
| hintymad wrote:
| A big lesson I learned from PyTorch vs other frameworks is that
| productivity trumps incremental performance improvement. Both
| Caffe and MXNet marketed themselves for being fast, yet
| apparently being faster here and here by some percentage simply
| didn't matter that much. On the other hand, once we make a system
| work and make it popular, the community will close the
| performance gap sooner than competitors expect.
|
| Another lesson is probably old but worth repeating: investment
| and professional polishing matters to open source projects. Meta
| reportedly had more than 300 (?) people working on PyTorch,
| helping the community resolve issues, producing tons of high-
| quality documentation and libraries, and marketing themselves
| nicely in all kinds of conferences and media.
| wging wrote:
| > Meta reportedly had more than 300 (?) people working on
| PyTorch,
|
| How much did this change after the big Meta layoffs? I think I
| know people who are no longer there, but I haven't talked to
| them about it yet.
| PartiallyTyped wrote:
| NB: PyTorch is now under the Linux Foundation.
|
| https://news.ycombinator.com/item?id=32810976
| not2b wrote:
| Because Meta open sourced it, and because PyTorch caught on,
| hopefully some of those laid off people can continue to work
| on it and also market themselves as PyTorch experts.
| SleekEagle wrote:
| Exactly, especially in the age of ridiculously rapid
| development that we have found ourselves in over the past few
| years. This is exactly why TensorFlow is dying
| la_fayette wrote:
| :) it still is impossible to bring a cnn rnn network in
| pytorch to mobile, which works fine with tflite...
| levesque wrote:
| I don't think that's a problem for the vast majority of
| pytorch users
| vikinghckr wrote:
| What I really find interesting here is that PyTorch, a library
| maintained by Facebook, is winning the marketshare and
| mindshare due to clean API, whereas Tensorflow, maintained by
| Google, is losing due to inferior API. In general, Google as a
| company emphasizes code quality and best practices far more
| than Facebook. But the story was reversed here.
| miohtama wrote:
| An interesting question. Maybe Google lacks the culture to
| work with external developers (think Android) while Facebook
| has some of it.
| version_five wrote:
| Pure speculation but isn't Google the king of "beta" releases
| that demonstrate a concept but mostly end up with a 90%
| solution that doesn't meet the mark for a finished project.
|
| I'm sure there are other factors as well, but it doesn't
| surprise me that Google made something that started off
| promising and then underdelivered
| marban wrote:
| From what I see, still no 3.11 support -- Same for Tensorflow
| which won't ship before Q1 23.
| joelfried wrote:
| Who is still supporting Windows 3.11?
| sairahul82 wrote:
| It's python 3.11 :)
| robertlagrant wrote:
| Guido works for Microsoft, so Python 95 will be out soon.
| sgt wrote:
| Speaking of Guido, be sure to check out the podcast with
| Lex Fridman. Guido is such a down to Earth guy.
| itgoon wrote:
| I'm going to skip Python ME.
| minimaxir wrote:
| It's not a huge deal, as the speed improvements in 3.11 likely
| wouldn't trickle down at the core PyTorch level.
| black3r wrote:
| but if you do some data pre-processing or post-processing in
| python, that would be affected by 3.11 speed improvements...
| or if you have a pytorch based model integrated into a bigger
| application as just one of many features, there are still
| some devs who prefer monoliths over microservices....
| ansk wrote:
| So this looks like a further convergence of the tensorflow and
| pytorch APIs (the lower-level APIs at least). Tensorflow was
| designed with compilable graphs as the primary execution model
| and as part of their 2.0 release, they redesigned the APIs to
| encompass eager execution as well. Pytorch is coming from the
| other end, with eager execution being the default and now
| emphasizing improved tools for graph compilation in their 2.0
| release. The key differentiator going forward seems to be that
| tensorflow is using XLA as their compiler and pytorch is
| developing their own toolset for compilation. As someone who
| cares far more about performance than API ergonomics, the quality
| of the compiler is the main selling point for me and I'll gladly
| switch over to whatever framework is winning in the compiler
| race. Does anyone know of any benchmarks comparing the
| performance of pytorch's compilers with XLA?
| amelius wrote:
| How much performance can be squeezed from going from the plain
| python API to the graph-based solution, typically?
| ansk wrote:
| This varies quite a bit based on the type of model. The
| graph-based approach has two benefits: (1) removing overhead
| from executing python between operations and (2) enabling
| compilers to make optimizations based on the graph structure.
| The benefit from (1) is relatively modest for models which
| run a few large ops in series (e.g. image classifiers and
| most feedforward models) but can be significant for models
| with many ops that are smaller and not necessarily wired up
| sequentially (e.g. RNNs). In my experience, I've had RNN
| models run several times faster in tensorflow's graph mode
| than in its eager mode. The benefit from (2) is significant
| in almost any model since the typical "layer" building block
| (matmul/conv/einsum->bias->activation) can be fused together
| which improves throughput on GPUs. In my experience
| compilation can offer performance increases from 1.5x to 3x,
| but I don't know if this holds generally. Also note that the
| distinction between graph and eager execution can be somewhat
| blurry, as even an "eager" API could be calling a fused layer
| under the hood.
| PartiallyTyped wrote:
| It depends.. Using Jax to compile down to XLA, I often saw >2
| orders of magnitude improvements. This however was roughly 6
| months ago.
| chazeon wrote:
| How is PyTorch compares to JAX and its stack?
| PartiallyTyped wrote:
| I find that Jax tends to result in messy code unless you build
| good abstractions. I personally don't like Flax and Haiku, I
| prefer stax and Equinox as they are more transparent on what is
| happening, feel a lot less like magic, and more pythonic
| (explicit is better than implicit etc).
|
| PyTorch is far more friendly for deep learning stuff, but
| sometimes all you want is pure numerical computations that can
| be vmapped across tensors, and this is where jax shines imho.
|
| Personal Example: I needed to sample a bunch of datapoints,
| make distributions out of them, sample, and then compute the
| density of each sample across distributions. Doing this with
| pytorch was rather slow, I was probably doing something wrong
| with vectorization and broadcasting, but I didn't have the time
| to figure it out.
|
| With jax, I wrote a function that produces the samples, then I
| vmapped the evaluation of a sample across all distributions,
| then vmapped over all samples. Took a couple of minutes to
| implement and seconds to execute.
|
| PyTorch also has the advantage of a far more mature ecosystem,
| libraries like Lightning, Accelerate, Transformers, Evaluate,
| and so on make building models a breeze.
| zone411 wrote:
| > Personal Example: I needed to sample a bunch of datapoints,
| make distributions out of them, sample, and then compute the
| density of each sample across distributions. Doing this with
| pytorch was rather slow, I was probably doing something wrong
| with vectorization and broadcasting, but I didn't have the
| time to figure it out.
|
| You probably were not doing anything wrong. I spent a lot of
| time trying to be clever in order to parallelize things like
| this and it just wasn't possible without doing CUDA
| extensions. But it is now! PyTorch now has vmap through
| functorch and it works.
| staunch wrote:
| PyTorch and JAX are both open-source libraries for developing
| machine learning models, but they have some important
| differences. PyTorch is a more general-purpose library that
| provides a wide range of functionalities for developing and
| training machine learning models. It also has strong support
| for deep learning and is used by many researchers and companies
| in production environments.
|
| JAX, on the other hand, is designed specifically for high-
| performance machine learning research. It is built on top of
| the popular NumPy library and provides a set of tools for
| creating, optimizing, and executing machine learning algorithms
| with high performance. JAX also integrates with the popular
| Autograd library, which allows users to automatically
| differentiate functions for training machine learning models.
|
| Overall, the choice between PyTorch and JAX will depend on the
| specific requirements and goals of the project. PyTorch is a
| good choice for general-purpose machine learning development
| and is widely used in industry, while JAX is a better choice
| for high-performance research and experimentation.
|
| https://chat.openai.com/chat
| satvikpendem wrote:
| It seems to use the same type of template for comparisons:
|
| React and Vue are both JavaScript libraries for building user
| interfaces. The main difference between the two is that React
| is developed and maintained by Facebook, while Vue is an
| independent open-source project.
|
| React uses a virtual DOM (Document Object Model) to update
| the rendered components efficiently, while Vue uses a more
| intuitive and straightforward approach to rendering
| components. This makes Vue easier to learn and use,
| especially for developers who are new to front-end
| development.
|
| React also has a larger community and ecosystem, with a wider
| range of available libraries and tools. This can make it a
| better choice for larger, more complex projects, while Vue
| may be a better fit for smaller projects or teams that prefer
| a more lightweight and flexible approach.
|
| Overall, the choice between React and Vue will depend on your
| specific project requirements and personal preferences. It's
| worth trying out both to see which one works better for you.
| whimsicalism wrote:
| I was reading this and thinking it was a pretty terrible
| answer - glad it is just generated by an AI and not you
| personally so I'm not insulting you.
|
| JAX is basically numpy on steroids and lets you do a lot of
| non-standard things (like a differentiable physics simulation
| or something) that would be harder with Pytorch.
|
| They are both "high-performance."
|
| Pytorch is more geared towards traditional deep learning and
| has the utilities and idioms to support it.
| brap wrote:
| I'm not sure why, but I realized it was AI from the very
| first sentence, not exaggerating. It's just not something
| someone on HN would write.
| eastWestMath wrote:
| It reminded me of the sort of lazy Wikipedia
| regurgitation that a lot of undergrads used to give when
| I was teaching. So it is a bit jarring to see a response
| like that in a non-compulsory setting.
| windsignaling wrote:
| Yup. Reminds me of an article you'd find in the top 10
| Google search results...
| dekhn wrote:
| jax is not numpy on steroids. jax is "use python
| idiomatically to generate optimized XLA code for evaluating
| functions both forward and backward."
| whimsicalism wrote:
| Probably the primary use of jax is `jax.numpy` which is
| XLA accelerated and differentiable numpy.
|
| I'll admit that saying "basically numpy on steroids"
| might have been an overreduction. It is a system for
| function transformations that is built on XLA and
| oriented towards science & ML applications.
|
| It's not just me saying stuff like this.
|
| Francois Chollet (creator of Keras): "[jax is] basically
| Numpy with gradients. And it can compile to XLA, for
| strong GPU/TPU acceleration. It's an ideal fit for
| researchers who want maximum flexibility when
| implementing new ideas from scratch."
| dekhn wrote:
| Yes- and that gradient part is a key detail that makes it
| more than "numpy on steroids". numpy on steroids would be
| a hardware accelerator that took numpy calls and made
| them return more quickly, but without the command-and-
| control and compile-python-to-xla aspects.
| whimsicalism wrote:
| Well clearly I meant steroids of the gradient-developing
| variety.
|
| I think you are being far too pedantic about what a
| biological compound would analogously do to a software
| library, especially given that I mention the
| differentiability property in the same sentence you are
| taking issue with.
| dekhn wrote:
| OK, actually as long as it's gradient-developing
| steroids, I'll allow it.
| uoaei wrote:
| Can someone comment more on what makes JAX that much better
| for differentiable simulations than PyTorch?
|
| I'm working on a new module for work and none of my
| colleagues have much experience developing ML per se. I'm
| trying to decide whether to force their hand by
| implementing v1 in PyTorch or JAX and differentiable
| physics simulations is a likely future use case. Why is
| PyTorch harder?
| patrickkidger wrote:
| At least prior to this announcement: JAX was much faster
| than PyTorch for differentiable physics. (Better JIT
| compiler; reduced Python-level overhead.)
|
| E.g for numerical ODE simulation, I've found that Diffrax
| (https://github.com/patrick-kidger/diffrax) is ~100 times
| faster than torchdiffeq on the forward pass. The backward
| pass is much closer, and for this Diffrax is about 1.5
| times faster.
|
| It remains to be seen how PyTorch 2.0 will compare, of
| course!
|
| Right now my job is actually building out the scientific
| computing ecosystem in JAX, so feel free to ping me with
| any other questions.
| adgjlsfhk1 wrote:
| If you care about performance of differential physics you
| shouldn't use python. Diffrax is almost OKish, but is
| missing a ton of features (e.g. good stiff solvers,
| arbitrary precision support, events for anything other
| than stopping the simulation, ability to control the
| linear solve which are needed for large problems). For
| simple cases it can come close to the C++/Julia solvers,
| but for anything complicated, you either won't be able to
| formulate the model, or you won't be able to solve it
| efficiently.
| patrickkidger wrote:
| > If you care about performance
|
| This definitely isn't true. On any benchmark I've tried,
| JAX and Julia basically match each other. Usually I find
| JAX to be a bit faster, but that might just be that I'm a
| bit more skilled at optimising that framework.
|
| Anyway I'm not going to try and debunk things point-by-
| point, I'd rather avoid yet another unpleasant Julia
| flame-war.
| chazeon wrote:
| I have seen JAX-MD[1] but not sure about "much better".
| On the other hand, there is just no MD implemented with
| PyTorch.
|
| [1]: https://github.com/jax-md/jax-md
| whimsicalism wrote:
| Because the `jax.numpy` operations & primitives are
| almost 1:1 with numpy, many working scientists who
| already have experience working with numpy will be able
| to figure out jax faster.
|
| It is also easier to rewrite existing code/snippets (say
| you were working on a non-differentiable simulator
| before) into jax if you already have them in numpy then
| to do the whole rewrite in pytorch.
|
| I will say that I think pytorch has improved its numpy
| compatability a lot in recent years, functions that I was
| convinced didn't exist with pytorch (like eigh)
| apparently actually do.
| cube2222 wrote:
| It's funny, cause already after the first sentence it felt
| like ChatGPT, probably because I've played with it a lot
| these past few days, and expectedly I found a disclaimer at
| the end.
|
| That said, the answer isn't really useful, as it's very
| generic, without anything concrete (other than the mention of
| Autograd) imo.
|
| Though a follow up question might improve on that.
| singularity2001 wrote:
| Getting Started
|
| ...
|
| and zero words on how to get started.
|
| pip3 install torch2?
|
| pip3 install torch==2.0? nope
| SekstiNi wrote:
| https://pytorch.org/get-started/pytorch-2.0/#requirements
| quietbritishjim wrote:
| > Today, we announce torch.compile, a feature that pushes PyTorch
| performance to new heights and starts the move for parts of
| PyTorch from C++ back into Python.
|
| I'll admit I don't know enough about PyTorch to know what
| torch.compile is exactly. But does this means some features of
| PyTorch will no longer be available in the core C++ library? One
| of the nice things about PyTorch had been that you could do your
| training in Python then deploy with a pure C++ application.
| fddr wrote:
| The `torch.compile` API itself will not be available from C++.
| That means that you won't get the pytorch 2.0 performance gains
| if you use it via C++ API.
|
| There's no plan to deprecate the existing C++ API, it should
| keep working as it is. However, a common theme of all the
| changes is implementing more of pytorch in python (explicitly
| the goal of primtorch), so if this plan works it could happen
| in the long run.
| danieldk wrote:
| _One of the nice things about PyTorch had been that you could
| do your training in Python then deploy with a pure C++
| application._
|
| Or even train in C++ or Rust without much loss in
| functionality.
| synergy20 wrote:
| Rust really has not had any presence in AI training engine
| yet, it's probably 100% c++.
| danieldk wrote:
| I was referring to the libtorch library, which you can use
| through the tch crate. It is possible to make such rich
| bindings because so much of Torch is exposed through the
| C++ API. When more new functionality is moved to Python, it
| makes it harder to use functionality from the C++ interface
| and downstream bindings.
| synergy20 wrote:
| Facebook did similar thing to its original code PHP, it uses
| HHVM to 'compile' PHP(now called Hacklang) to gain performance,
| it seems doing similar thing with python here.
| algon33 wrote:
| The FAQ contains re-states the content for point 14 in point 13.
| Point 14 is about why your code might be slower when using 2.0.
| 13 should be about how to keep up with PT 2.0 developments.
| Someone should change that.
| belval wrote:
| > We believe that this is a substantial new direction for PyTorch
| - hence we call it 2.0. torch.compile is a fully additive (and
| optional) feature and hence 2.0 is 100% backward compatible by
| definition.
|
| How about just calling it PyTorch 1.14 if it's backward
| compatible? Version numbering shouldn't be used as a marketing
| gimmick.
| posharma wrote:
| Is this really the biggest problem that needs to be solved in
| AI?
| whimsicalism wrote:
| No? What would have given you that impression?
|
| Oh, I see. You were trying to be dismissive.
| robertlagrant wrote:
| Quite the non sequitur you have there.
| belval wrote:
| Not sure I understand that question, is versioning the
| biggest problem no, but it costs nothing to keep semver and
| prevent production headaches later.
|
| If you meant inference speed then yeah it's a very big
| problem so it's good that they are addressing it.
| mi_lk wrote:
| what exact production headaches you are expecting by bump
| the number from 1.13 -> 2.0, while all existing codes keep
| working as before?
|
| And how is it different from bumping 1.13 to 1.14, even if
| they named it 1.14?
| belval wrote:
| The soft kind. Major versions are deeply ingrained as
| "possible backward-compatibility issues" in most
| engineers' brain. If you handle model development,
| evaluation and deployment yourself than sure you won't
| have any issues, but in a bigger organization you have to
| get people to switch and that version number will mean
| that everyone will ask the same "hang on this is a major
| version change?!" question every step of the way.
| pdntspa wrote:
| They're saying it represents a change in direction and is a
| pretty big feature, traditionally that's been a good reason to
| increment a major version number.
| js2 wrote:
| Dismissive comments like this make me not want to read HN
| anymore and in addition it's against the HN guidelines:
|
| It's snarky. It's incurious. It's neither thoughtful nor
| substantive. It's flame bait. It's a shallow dismissal. It
| doesn't teach anything. It's the most provocative thing to
| complain about.
|
| https://news.ycombinator.com/newsguidelines.html
|
| I'm sorry I had to leave this comment, so let me also try to
| respond thoughtfully:
|
| Assuming that PyTorch is using semantic versioning requires
| that the major version MUST change when making a backwards
| incompatible API change:
|
| > Major version X (X.y.z | X > 0) MUST be incremented if any
| backwards incompatible changes are introduced to the public
| API. It MAY also include minor and patch level changes. Patch
| and minor versions MUST be reset to 0 when major version is
| incremented.
|
| This requirement does NOT preclude changing the major version
| when making backwards-compatible changes.
|
| PyTorch has not violated semver here. It is absolutely
| compatible with semver to bump the major version for marketing
| reasons.
|
| https://semver.org/
| belval wrote:
| Personal attack aside, from your own link:
|
| > Given a version number MAJOR.MINOR.PATCH, increment the:
|
| > MAJOR version when you make incompatible API changes
|
| > MINOR version when you add functionality in a backwards
| compatible manner
|
| > PATCH version when you make backwards compatible bug fixes
|
| > Additional labels for pre-release and build metadata are
| available as extensions to the MAJOR.MINOR.PATCH format.
|
| You can point towards some other details, but it doesn't
| change the fact that for the overwhelming majority of people,
| the quote above is what semver is. Besides, my original
| comment does not say "They broke semver", it says they
| shouldn't bump the major version if they don't make backward
| incompatible change because afterwards the mental model of
| "Can I use version X.Y.Z?" is broken.
|
| When TensorFlow moved to 2.0 it's because they were changing
| from graphs and session definition to eager mode. That makes
| sense, that means the underlying API and how the downstream
| users interact with it changed. These are just newer features
| that, while very useful, have limited bearing on downstream
| users.
___________________________________________________________________
(page generated 2022-12-02 23:01 UTC)