[HN Gopher] Accelerated PyTorch Training on M1 Mac
___________________________________________________________________
Accelerated PyTorch Training on M1 Mac
Author : tgymnich
Score : 335 points
Date : 2022-05-18 15:33 UTC (7 hours ago)
(HTM) web link (pytorch.org)
(TXT) w3m dump (pytorch.org)
| buildbot wrote:
| This is very interesting since the M1 studio supports 128GB of
| unified memory - training a large memory heavy model slowly on a
| single device could be interesting, or inferencing a very large
| model.
| zdw wrote:
| Everything old is new again - the M1 studio's unified memory
| echos the SGI O2 which had similar unified CPU/GPU memory back
| in the 90's.
|
| In both cases the unified memory machines outperformed much
| larger machines in specific use cases.
| smoldesu wrote:
| ... _specific use cases_ being the key operand here. Unified
| memory is cool, but there are reasons we don 't use it at-
| scale:
|
| - It needs extremely high-bandwidth controllers, which
| severely limits the amount of memory you can use (Intel Macs
| could be configured with an order of magnitude more ram in
| it's server chips)
|
| - ECC is still off-the-table on M1 apparently
|
| - _Most_ workloads aren 't really constrained by memory
| access in modern programs/kernels/compilers. Problems only
| show up when you want to run a GPU off the same memory, which
| is what these new Macs account for.
|
| - _Most_ of the so-called "specific workloads" that you're
| outlining aren't very general applications. So far I've only
| seen ARM outrun x86 in some low-precision physics demos,
| which is... fine, I guess? I still don't foresee
| meteorologists dropping their Intel rigs to buy a Mac Studio
| anytime soon.
| my123 wrote:
| > - It needs extremely high-bandwidth controllers, which
| severely limits the amount of memory you can use (Intel
| Macs could be configured with an order of magnitude more
| ram in it's server chips)
|
| In the first half of 2023, NVIDIA Grace Superchip will ship
| with an 1TB memory config (930GB usable because ECC bits)
| on a 1024-bit wide LPDDR5X-8533 config (same width as M1
| Ultra, with LPDDR5-6400).
|
| So it's going to become much less of an issue really soon.
| zdw wrote:
| > So it's going to become much less of an issue really
| soon.
|
| The main issue would be trying to purchase one of those,
| which is likely going to be both very rare and orders of
| magnitude more expensive than a Mac Studio.
|
| The Mac Studio isn't some crazy exotic hardware like
| datacenter class GPUs, but definitely has some exotic
| capabilities.
| my123 wrote:
| > The Mac Studio isn't some crazy exotic hardware like
| datacenter class GPUs, but definitely has some exotic
| capabilities.
|
| Datacenter class GPUs are expensive yeah, but are quite
| easy to buy, even in a single unit amount.
|
| example: https://www.dell.com/en-us/work/shop/nvidia-
| ampere-a100-pcie... for the first random link, but there
| are other stores selling them for significantly cheaper.
|
| I wonder what their CPU pricing will be though... we'll
| see I guess.
| Q6T46nT668w6i3m wrote:
| > Most workloads aren't really constrained by memory access
| in modern programs/kernels/compilers. Problems only show up
| when you want to run a GPU off the same memory, which is
| what these new Macs account for.
|
| For sure but I expect this is different for the apps Apple
| _wants_ to write. It's easy to imagine the next version of
| Logic or whatever doing fine tuning everywhere.
| smoldesu wrote:
| What is there to fine-tune, in a program like Logic? I've
| often heard that word associated with using extended
| instruction sets and leveraging accelerators, but where
| would the M1 have "untapped power" so-to-speak? I don't
| think the "upgrade" from a CISC architecture to a RISC
| one can yield much opportunity for optimization, at least
| not besides what the compiler already does for you.
| sbeckeriv wrote:
| What is the * in the chart referencing?
| mrchucklepants wrote:
| Probably supposed to be referencing the text under the plot
| stating the specific configuration of the hardware and
| software.
| sbeckeriv wrote:
| looks like the website was updated after I posted. I used
| page search to look for the *.
| munro wrote:
| yess! This is important for me, because I don't have any $$$ to
| rent GPUs for personal projects. Now we just need M1 support for
| JAX.
|
| Since there are no hard benchmarks against other GPUs, here's a
| Geekbench against an RTX 3080 Mobile laptop I have [1]. Looks
| like it's about 2x slower--the RTX laptop absolutely rips for
| gaming, I love it.
|
| [1]
| https://browser.geekbench.com/v5/compute/compare/4140651?bas...
| jph00 wrote:
| You can use GPUs for free on Paperspace Gradient, Google Colab,
| and Kaggle.
| mkaic wrote:
| This is really cool for a number of reasons:
|
| 1.) Apple Silicon _currently_ can 't compete with Nvidia GPUs in
| terms of raw compute power, but they're already way ahead on
| energy efficiency. Training a small deep learning model on
| battery power on a laptop could actually be a thing now.
|
| Edit: I've been informed that for matrix math, Apple Silicon
| isn't actually ahead in efficiency
|
| 2.) Apple Silicon probably _will_ compete directly with Nvidia
| GPUs in the near future in terms of raw compute power in future
| generations of products like the Mac Studio and Mac Pro, which is
| very exciting. Competition in this space is incredibly good for
| consumers.
|
| 3.) At $4800, an M1 Ultra Mac Studio appears to be far and away
| the cheapest machine you can buy with 128GB of GPU memory. With
| proper PyTorch support, we'll actually be able to use this memory
| for training big models or using big batch sizes. For the kind of
| DL work I do where dataloading is much more of a bottleneck than
| actual raw compute power, Mac Studio is now looking _very_
| enticing.
| smoldesu wrote:
| There's definitely competition, and it's going to be _really
| interesting_ to watch Nvidia and Apple duke it out over the
| next few years:
|
| - Apple undoubtedly _owns_ the densest nodes, and will fight
| TSMC tooth-and-nail over first dibs on whatever silicon they
| have coming next.
|
| - Apple's current GPU design philosophy relies on horizontally
| scaling the tech they already use, whereas Nvidia has been
| scaling vertically, albeit slowly.
|
| - Nvidia has _insane_ engineers. Despite the fact they 're
| using silicon that's more than twice as large by-area when
| compared to Apple, they're still doubling their numbers across
| the board. And that's their last-gen tech too, the comparison
| once they're on 5nm later this summer is going to be _insane_.
|
| I expect things to be very heated by the end of this year, with
| new Nvidia, Intel and _potentially_ new Apple GPUs.
| my123 wrote:
| > but they're already way ahead on energy efficiency
|
| 1) Nope. For neural network training not the case:
| https://tlkh.dev/benchmarking-the-apple-m1-max
|
| And that's with the 3090 set at a very high 400W power limit,
| can get far more efficient when clocked lower.
|
| (which is normal, because no dedicated matrix math accelerators
| on the GPU notably)
|
| 2) We'll see, hopefully Apple thinks that the market is worth
| bothering with... (which would be great)
|
| 3) Indeed, if you need a giant pool of VRAM above everything
| else at a relatively low price tag, Apple is indeed a quite
| enticing option. If you can stand Metal for your use case of
| course.
| hedgehog wrote:
| To me the cool thing is working through a PyTorch-based course
| like FastAI on a local Mac may now be above the tolerably fast
| threshold.
| [deleted]
| mhh__ wrote:
| The thing is with the efficiency (which I'm not sure of) and
| the competition (probably possible) is that the current nvidia
| lineup is pretty old and on an even older process. They have a
| big moat.
| dekhn wrote:
| I remain skeptical that Apple's best GPU silicon will match
| nvidia's premiere products (either the top-end desktop card, or
| a server monster) for training.
|
| It seems like this is ideal as an accelerator for already
| trained models; one can imagine Photoshop utilizing it for
| deep-learning based infill-painting.
|
| I was doing training on battery with a laptop that had a 1080
| and could do training; I have trained models on the airplane
| while totalyl unplugged and still had enough power to websurf
| afterwards.
| sudosysgen wrote:
| Apple Silicon is not ahead at all on energy efficiency for
| desktop workloads. If they were ahead on energy efficiency,
| they would simply be ahead on power. Indeed, GPUs are massively
| parallel architectures, and they are generally limited by the
| transistor and power budget (and memory, of course).
|
| Apple is simply behind in the GPU space.
|
| > At $4800, an M1 Ultra Mac Studio appears to be far and away
| the cheapest machine you can buy with 128GB of GPU memory. With
| proper PyTorch support, we'll actually be able to use this
| memory for training big models or using big batch sizes. For
| the kind of DL work I do where dataloading is much more of a
| bottleneck than actual raw compute power, Mac Studio is now
| looking very enticing.
|
| The reason why it's cheaper is that its memory is at a
| _fraction_ (around 20-35%) of the memory bandwidth of a 128GB
| equivalent GPU set up, which also has to be split with the CPU.
| This is an unavoidable bottleneck of shared memory systems, and
| for a great many applications this is a terminal performance
| bottleneck.
|
| That's the reason you don't have a GPU with 128GB of normal
| DDR5. It would just be quite limited. Perhaps for some cases it
| can be useful.
| p1esk wrote:
| _its memory is at a fraction (around 30-40%) of the memory
| bandwidth of a 128GB equivalent GPU setup_
|
| Here's some info about M1 memory bandwidth:
| https://www.anandtech.com/show/17024/apple-m1-max-
| performanc...
| sudosysgen wrote:
| Yes. And the M1 Ultra has even more memory bandwidth than
| the M1 Max. But a 128 GB system made of 3 NVidia A6000 has
| 3x768Gb/s of memory bandwidth, a more common AI-grade card
| has 2x2Tb/s of memory bandwidth, which simply dwarfs the M1
| Ultra.
| matthew-wegner wrote:
| For researchers, sure, but it's still quite an apples-to-
| oranges comparison.
|
| A6000 is ~$5k per card. I guess you're referring to
| something like an A100 on that other spec, which is
| $10k/card (for 40GB of memory).
|
| I do a fair bit of neural/AI art experimentation, where
| memory on the execution side is sometimes a limiting
| factor for me. I'm not training models, I'm not a
| hardcore researcher--those folks will absolutely be using
| NVIDIA's high-end stuff or TPU pods.
|
| 128GB in a Studio is super compelling if it means I can
| up-res some of my pieces without needing to use high-
| memory-but-super-slow CPU cloud VMs, or hope I get lucky
| with an A100 on Colab (or just pay for a GPU VM).
|
| I have a 128GB/Ultra Studio in my office now. It's a
| great piece of kit, and a big reason I splurged on it--
| okay, maybe "excuse"--was that I expect it'll be useful
| for a lot of my side project workloads over the next
| couple of years...
| sudosysgen wrote:
| Hmm, that's interesting. What kind of inference workload
| requires more than the 48GB of memory you'd get from 2
| 3090s, for example? I'm genuinely curious because I
| haven't ran across them and it sounds interesting
| mkaic wrote:
| Not sure about inference but for training, 128GB is big
| enough to fit a decent-sized dataset entirely into
| memory, which causes a massive speedup. It's also
| probably cheaper to get a 128GB Mac Studio than a
| dual-3090 rig unless you're willing to build the rig
| yourself and pay the bare minimum for every component
| except the GPUs themselves.
|
| As for 128GB memory _on-inference_ models that a consumer
| would be interested in, I got nothing, though it
| certainly seems like it would be fun to mess around with
| haha
| matthew-wegner wrote:
| Mostly it's old-school transfer style transfer! Well,
| "old" in the sense that it's pre-CLIP. I've played with
| CLIP-guided stuff too, but I've been tinkering with a
| custom style transfer workflow for a few years. The
| pipeline here is fractal IFS images (Chaotica/JWildfire)
| -> misc processing -> style transfer -> photo editing,
| basically.
|
| Only the workflow is the custom part--the core here is
| literally the original jcjohnson implementation.
| Occasionally I look around at recent work in the area,
| but most seems focused on fast (video-speed) inference or
| pre-baked style models. I've never seen something that
| retains artistic flexibility.
|
| My original gut feeling on style transfer was that it
| would be possible to mold it into a neat tool, but most
| people bumped into it, ran their profile photo against
| Starry Night, said "cool" and bounced off. And I get that
| --parameter tuning can be a sloooow process. When I
| really explore a series with a particular style I start
| to feed it custom content images made just for how it's
| reacting with various inputs.
|
| Here's a piece that just finished a few minutes ago:
| https://mwegner.com/misc/styled_render-
| BMrHXWz_2RBaUq8pAYKfL...
|
| That's from a local server in my garage with a K80. At
| some point I had two K80s in there (so basically four
| K40s with how they work), but dialed it back for power
| consumption/power reasons.
|
| I do have a 3090 in the house, and a decent amount of
| cloud infra that I sometimes tap. The jcjohnson
| implementation is so far back that it doesn't even run
| against modern hardware. At some point I need to sort
| that out, or figure out how to wrangle a more modern
| implementation into behaving in the way that I like.
|
| I don't really post these anywhere, although do throw
| them over the wall on Twitter if anyone is curious to see
| more. These are a mix of things, although the
| CLIP/Midjourney/etc stuff is pretty easy to spot:
| https://twitter.com/mwegner/media
| visarga wrote:
| GPT-3 sized models need that kind of memory for inference
| sudosysgen wrote:
| GPT-3 is more like 300GB iirc
| mkaic wrote:
| Interesting, I wasn't aware of the memory bandwidth point,
| though it makes sense. TIL!
| ActorNightly wrote:
| > but they're already way ahead on energy efficiency.
|
| For raw compute like you need for ML training, the M1s
| efficiency doesn't matter. Under the hood at hardware level,
| you have a direct mapping of power consumption to compute
| circuit activation that you really can't get around.
|
| The general efficiency of M1 is due its architecture and how it
| fits together with normal consumer use. Less stuff on the
| instruction decode, more efficient reordering, less energy
| wasted moving around data due to shared memory architecture,
| e.t.c
| ribit wrote:
| And yet somehow Apples GPU ALUs are more efficient at 3.8
| watts per TFLOP. Mind, I am not talking about specialized
| matrix multiplication units that have a different internal
| organization and can do things like matrix multiplication
| much more efficiently, but about basic general-purpose GPU
| ALUs.
|
| The comparison of efficiency between Apple and Nvidia here is
| a bit misleading because one compares Apples general-purpose
| ALUs to Nvidia's specialized ALUs. For a more direct
| efficiency comparison, one would need to compare the Tensor
| Cores against the AMX or ANE coprocessors.
|
| As to how Apple achieves such high efficiency, nobody knows.
| The fact that they are on 5nm node might help, but there must
| be something special about the ALU design as well. My
| speculation is that they are wider and much more simpler than
| in other GPUs, which directly translates to efficiency wins.
| [deleted]
| arecurrence wrote:
| This is much nicer ergonomics than what I had to do for
| tensorflow. It's ostensibly out of the box support as a different
| torch device.
| mark_l_watson wrote:
| I agree. I appreciated the M1/Metal TensorFlow support, but
| that was not as easy to setup.
| alfalfasprout wrote:
| I mean, building tensorflow is generally an awful experience.
| dangrie158 wrote:
| lekevicius wrote:
| Curiously neither PyTorch nor Tensorflow currently use M1's
| Neural Engine. Is too limited? Too hard to interact with? Not
| worth the effort?
| why_only_15 wrote:
| The ANE only has support for calculations with fp16, int16 and
| int8 all of which are too small to train with (too much
| instability). A common thing to do is train in fp32 to be able
| to get the small differences and gradients and then once the
| model is frozen do inference on fp16 or bf16.
| jph00 wrote:
| Using mixed precision training you can do most operations in
| fp16 and just a few in fp32 where it's needed. This is the
| norm for NVIDIA GPU training nowadays. For instance using
| fastai add `.to_fp16()` after your learner call, and that
| happens automatically.
| omegalulw wrote:
| How is the choice between fp16 and fp32 made? Is it like if
| any gradients in the tensor need the extra range you use
| fp32?
| h-jones wrote:
| The PyTorch docs give a pretty good overview of AMP here
| https://pytorch.org/tutorials/recipes/recipes/amp_recipe.
| htm... and an overview of which operations cast to which
| dtype can be found here
| https://pytorch.org/docs/stable/amp.html#autocast-op-
| referen....
|
| Edit: Fixed second link.
| RicoElectrico wrote:
| Most probably Neural Engine is optimized for inference, not
| training.
| sillyinseattle wrote:
| Question about terminology (no background in AI). In
| econometrics, estimation is model fitting (training, I
| guess), and inference refers to hypothesis testing (e.g. t or
| F tests). What does inference mean here?
| malshe wrote:
| I have background in both and it's very confusing to me.
| Inference in DL is running a trained model to
| predict/classify. Inference in stats and econometrics is
| totally different as you noted.
| mattkrause wrote:
| Prediction.
|
| The model is literally "inferring" something about its
| inputs: e.g., these pixels denote a hot dog, those don't.
| iamaaditya wrote:
| In machine learning (especially deep learning or neural
| networks), the 'training' is done by using Stochastic
| Gradient Descent. These gradients are computed using
| Backpropagation. Backpropagation requires you to do a
| backward pass of your model (typically many layers of
| neural weights) and thus requires you to keep in memory a
| lot of intermediate values (called activations). However,
| if you are doing "inference" that is if the goal is only to
| get the result but not improve the model, then you don't
| have to do the backpropagation and thus you don't need to
| store/save the intermediate values. As the layers and
| number of parameters in Deep Learning grows, this
| difference in computation in training vs inference becomes
| signifiant. In most modern applications of ML, you train
| once but infer many times, and thus it makes sense to have
| specialized hardware that is optimized for "inference" at
| the cost of its inability to do "training".
| eklitzke wrote:
| Just to add to this, the reason these inference
| accelerators have become big recently (see also the
| "neural core" in Pixel phones) is because they help doing
| inference tasks in real time (lower model latency) with
| better power usage than a GPU.
|
| As a concrete example, on a camera you might want to run
| a facial detector so the camera can automatically adjust
| its focus when it sees a human face. Or you might want a
| person detector that can detect the outline of the person
| in the shot, so that you can blur/change their background
| in something like a Zoom call. All of these applications
| are going to work better if you can run your model at,
| say, 60 HZ instead of 20 HZ. Optimizing hardware to do
| inference tasks like this as fast as possible with the
| least possible power usage it pretty different from
| optimizing for all the things a GPU needs to do, so you
| might end up with hardware that has both and uses them
| for different tasks.
| sillyinseattle wrote:
| Thank you @iamaaditya and @eklitzke . Very informative
| dataexporter wrote:
| This sounds really fascinating. Are there any resources
| that you'd recommend for someone who's starting out in
| learning all this? I'm a complete beginner when it comes
| to Machine Learning.
| dr_zoidberg wrote:
| Deep Learning with Python (2nd ed), by Francois Chollet.
|
| If you don't mind about learning the part where you
| program, it's got a lot of beginner/intermediate concepts
| clearly explained. If you do dive into the programming
| examples, you get to play around with a few architectures
| and ideas and you're left on the step to dive into the
| more advanced material knowing what you're doing.
| upwardbound wrote:
| Inference here means "running" the model. So maybe it has a
| similar meaning as in econometrics?
|
| Training is learning the weights (millions or billions of
| parameters) that control the model's behavior, vs inference
| is "running" the trained model on user data.
| Q6T46nT668w6i3m wrote:
| I'm surprised nobody has provided the basic explanation:
| inference, here, means matrix, matrix or matrix, scalar
| multiplication.
| abm53 wrote:
| It is confusing that the ML community have come to use
| "inference" to mean prediction, whereas statisticians have
| long used it to refer to training/fitting, or hypothesis
| testing.
|
| I'm not sure when or why this started.
| munro wrote:
| That /sounds/ right, but training still has a forward part,
| so OP does raise a really great question. And looking at the
| silicon, the neural engine is almost the size of the GPU.
| Really need someone educated in this area to chime in :)
| dgacmu wrote:
| You have to stash more information from the forward pass in
| order to calculate the gradients during backprop. You can't
| just naively use an inference accelerator as part of
| training - inference-only gets to discard intermediate
| activations immediately.
|
| (Also, many inference accelerators use lower precision than
| you do when training)
|
| There are tricks you can do to use inference to accelerate
| training, such as one we developed to focus on likely-
| poorly-performing examples:
| https://arxiv.org/abs/1910.00762
| my123 wrote:
| The neural engine is only exposed through a CoreML
| inference API.
|
| You can't even poke the ANE hardware directly from a
| regular process. The interface for accessing the neural
| engine is not hardened (you can easily crash the machine
| from it).
|
| So the matter is essentially moot in practice as you'd need
| your users to run with SIP off...
| [deleted]
| singularity2001 wrote:
| Anyone else getting "illegal hardware instruction"?
|
| (pytorch_env) ~/dev/ai/ python -c "import torch"
| zimpenfish wrote:
| IIRC, when I had that problem, it was because it was loading
| the wrong arch for Python.
| Scene_Cast2 wrote:
| I'm curious about the performance compared to something like,
| say, the RTX 3070.
| ivstitia wrote:
| Here are some comparison numbers I've come across:
| https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning...
|
| It is not really comparable on a step per second level but the
| power consumption and now GPU memory will make it pretty
| enticing.
| apohn wrote:
| I wrote a comment about an Tensorflow on M1 comparison to some
| cloud providers. I imagine PyTorch on M1 would give similar
| results. I think the gist would be that the 3070 is going to be
| a better investment.
|
| https://news.ycombinator.com/item?id=30608125
| my123 wrote:
| Low. Apple doesn't have matrix math accelerators in their
| current GPUs.
|
| The neural engine is small and inference only. It's also only
| exposed by a far higher level interface, CoreML.
|
| Where it could still make sense is if you have a small VRAM
| pool on the dGPU and a big one on the M1, but with the price of
| a Mac, not sure that makes a lot of sense either in most
| scenarios compared to paying for a big dGPU.
| LeanderK wrote:
| > The neural engine is small and inference only
|
| Why is it inference only? At least the operations are the
| same...just a bunch of linear algebra
| londons_explore wrote:
| Inference is often done fixed point, whereas training is
| (usually) floating point.
|
| Inference also prefers different IO patterns, because you
| don't need to keep the activations for every layer ready
| for backpropogation.
| Kon-Peki wrote:
| > Apple doesn't have matrix math accelerators in their
| current GPUs.
|
| That's because the M1 has a dedicated matrix math accelerator
| called AMX [1]. I've used it with both Swift and pure C.
|
| https://medium.com/swlh/apples-m1-secret-
| coprocessor-6599492...
| my123 wrote:
| AMX is indeed very nice for FP64 where customer GPUs aren't
| an alternative at all.
|
| However, for lower precisions (which is what deep learning
| uses), you're much better off with a GPU.
| brrrrrm wrote:
| have you actually benchmarked that? I think (someone
| please correct me if I'm way off here) the AMX
| instructions can hit ~2.8tflops (fp16) per co-processor
| and there are 2 on the 7-core M1. That's 5.6tflops vs the
| 4.6tflops the GPU can hit.
| johndough wrote:
| Often the limiting factor is memory bandwidth instead of
| raw FLOPS, so dealing with 4 times larger data types
| (FP64 vs FP16) is a disadvantage.
| brrrrrm wrote:
| to clarify: I am comparing FP16 performance, which both
| the GPU and AMX have native support for.
|
| FP64 is _also_ supported by AMX, making it quite an
| impressive region of silicon.
| my123 wrote:
| Yeah that's within the M1 family, but get within dGPUs
| and it doesn't even come close.
|
| 30Tflops for a 3080 for vector FP32, but 119Tflops FP16
| dense with FP16 accumulate, 59.5 with FP32 accumulate,
| and if you exploit sparsity then that can go even higher.
| brrrrrm wrote:
| Ah yes, I misunderstood your original comment
| Kalanos wrote:
| Anyone care to comment on how this is better than Metal's
| TensorFlow support?
| ekelsen wrote:
| Nice results! But why are people still reporting benchmark
| results on VGG? Does anybody actually use this network anymore?
|
| Better would be mobilenets or efficientNets or NFNets or vision
| transformers or almost anything that's come out in the 8 years
| since VGG was published (great work it was at the time!).
| p1esk wrote:
| _why are people still reporting benchmark results on VGG?_
|
| Probably because it makes the hardware look good.
| DSingularity wrote:
| No. Because it is a way to compare performance. That's all.
| Just convenience.
| 0-_-0 wrote:
| This is the right answer. Efficient networks like
| EfficientNet are much harder to accelerate in HW.
| jorgemf wrote:
| Probably because it will be impossible to compare with old
| results. If every year the community chooses a different model,
| how are you going to compare results year over year?
| learndeeply wrote:
| ResNets have been around for 7 years...
| jorgemf wrote:
| It doesn't matter. Deep learning have been mainstream for
| only 10 years. MNIST is a dataset from 1998 and it is still
| being used in research papers. The most important thing is
| to have a constant baseline, and ResNets are a baseline.
|
| Think about changing the model every other year: - 2015:
| ResNet trained in Nvidia k80 - 2017: Inception trained in
| Nvidia 1080 ti - 2019: Transformer trained in Nvidia V100 -
| 2021: GTP-3 trained in a cluster
|
| Now you have your new fancy algorithm X and an Nvidia 4090.
| How much better is your algorithm compared to the state of
| the art, and how much have you improved compared to the
| algorithms 5 years ago? Now you are in a nightmare and you
| have to run all the past algorithms in order to compare it.
| Or how fast is the new Nvidia card? which noone still have
| and nvidia has decided to give numbers based on a their own
| model?
| 6gvONxR4sf7o wrote:
| > But why are people still reporting benchmark results on VGG?
|
| It makes me feel like i'm missing something! Is is still used
| as a backbone in the same way as legacy code is everywhere, or
| is it something else entirely??
| plonk wrote:
| > Does anybody actually use this network anymore?
|
| Why not? It's still good for simple classification tasks. We
| use it as an encoder for a segmentation model in some cases.
| Most ResNet variants are much heavier.
| jph00 wrote:
| I don't think that's true - have a look at this analysis
| here:
|
| https://www.kaggle.com/code/jhoward/which-image-models-
| are-b...
|
| Those slow and inaccurate models at the bottom of the graph
| are the VGG models. A resnet34 is faster and more accurate
| than any VGG model. And there are better options now -- for
| example resnet34d is as fast as resnet34, and more accurate.
| And then convnext is dramatically better still.
| YetAnotherNick wrote:
| > ResNet > VGG: ResNet-50 is faster than VGG-16 and more
| accurate than VGG-19 (7.02 vs 9.0); ResNet-101 is about the
| same speed as VGG-19 but much more accurate than VGG-16 (6.21
| vs 9.0).
|
| https://github.com/jcjohnson/cnn-
| benchmarks#:~:text=ResNet%2....
| toppy wrote:
| Does speed up refer to absolute value or percentage?
| dagmx wrote:
| At least for the charts, it looks like a multiplier (or divisor
| I guess) since the CPU baseline looks to be at 1
| toppy wrote:
| You're right! I've missed this.
| alexfromapex wrote:
| Since it's tangentially relevant, if you have an M1 Mac I've
| created some boilerplate for working with the latest Tensorflow
| with GPU acceleration as well:
| https://github.com/alexfromapex/tensorexperiments . I'm thinking
| of adding a branch for PyTorch now.
| masklinn wrote:
| Did you compare that to Apple's tf plugin to see what was what?
| galoisscobi wrote:
| This is great! Appreciate the note on H5Py troubleshooting as
| well.
| [deleted]
| cj8989 wrote:
| really hope to see some comparisons with nvidia gpus!
| amelius wrote:
| > Accelerated GPU training is enabled using Apple's Metal
| Performance Shaders (MPS) as a backend for PyTorch.
|
| What do shaders have to do with it? Deep learning is a mature
| field now, it shouldn't need to borrow compute architecture from
| the gaming/entertainment field. Anyone else find this
| disconcerting?
| my123 wrote:
| Apple doesn't have a separate API tailored towards compute
| only, but a single unified API that makes concessions to both.
|
| Concessions towards compute: a C++ programming language for
| device code (totally unlike what's done for most graphics
| APIs!)
|
| Concessions towards graphics: no single-source programming
| model at all for example...
| sudosysgen wrote:
| Many GPUs allow you to write device code in C++ via SYCL. It
| works well enough.
| dagmx wrote:
| Shaders are just the way compute is defined on the GPU.
|
| Why is that concerning to you?
| my123 wrote:
| That terminology isn't used at all in GPGPU compute APIs
| specifically tailored for that purpose, which use quite
| different programming models where you can mix host and
| device code in the same program.
|
| And there are "GPUs" today that can't do graphics at all (AMD
| MI100/MI200 generations) or in a restricted way (Hopper
| GH100) which has the fixed function pipeline only on two
| TPCs, for compatibility, but running very slowly due to that.
| alfalfasprout wrote:
| There's absolutely a lot of "graphics" terminology that
| spills into GPGPU. For example, texture memory in CUDA :)
| The reality is that GPU's, even the ones that can't output
| video, are ultimately still using hardware that largely is
| rooted in gaming. Obviously the underlying architectures
| for these ML cards are moving away from that (increasingly
| using more die space for ML related operations) but many of
| the core components like memory are still shared. It boils
| down to the fact that at the end of the day they're linear
| algebra processors.
| my123 wrote:
| I'd say that there has been quite some sharing between
| both back and forth. Evolutions in compute stacks shaped
| modern graphics APIs too.
|
| Texture units are indeed a part that is useful enough to
| be exposed to GPGPU compute APIs directly. The "shader"
| term itself disappeared quite early in those though, as
| did access to a good part of the FF pipeline including
| the rasterisers themselves.
| WhitneyLand wrote:
| It's not the greatest term even for graphics only.
|
| People new to CG are likely to intuit "shaders" as something
| related to, well, shading, but vertex shaders et al have
| nothing to do with the color of a pixel or a polygon.
| paulmd wrote:
| Wait until they learn a kernel has nothing to do with
| operating systems! And tensor operations have nothing to do
| with tensor objects! And texture memory often isn't even
| used for textures!
|
| It's an unfortunate set of terminology due to the way this
| space evolved from graphics programming - shader cores
| _used to do_ fixed-function shading! But then people wanted
| them to be able to run arbitrary shaders and not just
| fixed-function. And then hey, look at this neat processor,
| let 's run a compute program on it. At first that was
| "compute shaders" running across graphics APIs, then came
| CUDA, and later OpenCL. But it is still running on the part
| of the hardware that provides shading to the graphics
| pipeline.
|
| Similarly, texture memory actually used to be used for
| textures, now it is a general-purpose binding that
| coalesces any type of memory access that has 1D/2D/3D
| locality.
|
| You kinda just get used to it. Lots of niches have their
| own lingo that takes some learning. Mathematics is
| incomprehensible without it, really.
| geertj wrote:
| Not sure if it's concerning but it caught my eye as well.
| MasterScrat wrote:
| Small code example in the PyTorch doc:
|
| https://pytorch.org/docs/master/notes/mps.html
| ivstitia wrote:
| There was a report comparing M1 Pro with several other Nvidia
| GPUs from a few months ago:
| https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning...
|
| I'm curious on how the benchmarks change with this recent new
| release!
| nafizh wrote:
| Exciting!! But don't see comparison with any laptop Nvidia GPUs
| in terms of performance. That would be insightful.
| sudosysgen wrote:
| It compares unfavourably, but then again NVidia GPUs on laptop
| are massive powerhogs.
| smlacy wrote:
| Do apple users really _require_ the ability to train large ML
| models while mobile and without access to A /C power? Is this
| a real-world use case for the target market?
| sudosysgen wrote:
| Indeed, I doubt anyone really needs that. And anyways while
| training a model you'd be lucky to get an hour of battery
| life even on an M1 Max.
___________________________________________________________________
(page generated 2022-05-18 23:00 UTC)