[HN Gopher] Analyzing the performance of Tensorflow training on ...
       ___________________________________________________________________
        
       Analyzing the performance of Tensorflow training on M1 Mac Mini and
       Nvidia V100
        
       Author : briggers
       Score  : 219 points
       Date   : 2021-01-14 07:04 UTC (15 hours ago)
        
 (HTM) web link (wandb.ai)
 (TXT) w3m dump (wandb.ai)
        
       | lopuhin wrote:
       | > I chose MobileNetV2 to make iteration faster. When I tried
       | ResNet50 or other larger models the gap between the M1 and Nvidia
       | grew wider.
       | 
       | (and that's on CIFAR-10). But why not report these results and
       | also test on a more realistic datasets? The internet is full of
       | M1 TF brenchmarks on CIFAR or MNIST, has anyone seen something
       | different?
        
         | sillysaurusx wrote:
         | Hehe. That criticism could be applied to ML itself. :)
         | 
         | I wish ML used more than CIFNISTNet, but unfortunately there's
         | not a lot of standard datasets yet. (Even Imagenet is an
         | absolute pain to set up.)
        
           | sdenton4 wrote:
           | Tensorflow Datasets includes a lot of the 'standard' datasets
           | in a way that's dead simple to call up and use (including ~10
           | variants of imagenet):
           | https://www.tensorflow.org/datasets/catalog/overview
        
       | JohnHaugeland wrote:
       | "Can Apple's M1 do a good job? We cut things down to unrealstic
       | sizes, turned off cores, and p-hacked as hard as we could until
       | we found a way to pretend the answer was yes"
        
       | whywhywhywhy wrote:
       | >We can see better performance gains with the m1 when there are
       | fewer weights to train likely due to the superior memory
       | architecture of the M1.
       | 
       | Wasn't this whole "M1 memory" thing decided to be a myth now some
       | more technical people have dissected it?
        
         | rsynnott wrote:
         | As with many things, there isn't one "M1 memory" thing. It's a
         | combination of myth and real stuff. No, it isn't ultra-low
         | latency or high-bandwidth. But on the other hand, single core
         | achievable bandwidth is very high.
        
         | iforgotpassword wrote:
         | Myth or not, it's memory bandwidth is amazing, so I guess that
         | helps.
        
         | vletal wrote:
         | Could you please provide some resources on how the unified
         | memory model supposedly works? Why is it a "myth"?
        
           | tyingq wrote:
           | I believe it's referring not to unified memory, but some
           | speculation that the memory being closer to the CPU makes
           | some notable difference. That line was in a fair amount of
           | the initial articles about the M1, and fits the "myth"
           | description.
        
         | gameswithgo wrote:
         | I think two different memory things are being talked about:
         | 
         | 1. There is an idea that M1 has RAM that is vastly higher
         | bandwidth than intel/amd machines. In reality it is the same
         | laptop ddr ram that other machines have, though at a very high
         | clock rate. Not higher than the best intel laptops though. So
         | the bandwidth is not any more amazing than a top end Intel
         | laptop, and latency is no different.
         | 
         | 2. But in this case I believe they are talking about the CPU
         | and GPU both being able to freely access the same ram, as
         | compared to a setup where you have a discrete GPU with it's own
         | ram, where data must first be copied to the GPU ram for the GPU
         | to do something with it. In some workloads this can be an
         | inferior approach, in others it can be superior, as the GPU's
         | ram is faster. The M1 model again isn't unique, as its similar
         | to how game consoles work, I believe.
        
           | dragontamer wrote:
           | > In reality it is the same laptop ddr ram that other
           | machines have
           | 
           | LPDDR4 is more well known for cell phones than laptops
           | actually. I think it shows the stagnation of the laptop
           | market (and DDR4) that LPDDR4 is really catching up (and then
           | some). Or maybe... because cell phones are more widespread
           | these days, cell phones just naturally get the better tech?
           | 
           | On the other hand, M1 is pretty wide. Apple clearly is
           | tackling the memory bottleneck very strongly in its design.
           | 
           | DDR5 is going to be the next major step forward for
           | desktops/laptops.
           | 
           | > 2. But in this case I believe they are talking about the
           | CPU and GPU both being able to freely access the same ram, as
           | compared to a setup where you have a discrete GPU with it's
           | own ram, where data must first be copied to the GPU ram for
           | the GPU to do something with it. In some workloads this can
           | be an inferior approach, in others it can be superior, as the
           | GPU's ram is faster. The M1 model again isn't unique, as its
           | similar to how game consoles work, I believe.
           | 
           | More than just the "same RAM", but probably even shares the
           | same last-level cache. Both AMD's chips and Intel's iGPUs
           | share the cache with its CPU/GPU hybrid architectures.
           | 
           | However: it seems like on-core SIMD units (aka: AVX or ARM
           | NEON / SVE) are even lower latency, since those share L1
           | cache.
           | 
           | Any situation where you need low latency but SIMD, it makes
           | more sense to use AVX / SVE than even waiting for L3 cache to
           | talk to the iGPU. Any situation where you need massive
           | parallelism, a dedicated 3090 is more useful.
           | 
           | Its going to be tough to figure out a good use of iGPUs:
           | they're being squeezed on the latency front (by things like
           | A64Fx: 512-bit ARM SIMD, as well as AVX512 on the Intel
           | side), and also squeezed by the bandwidth front (by classic
           | GPUs)
        
         | coldtea wrote:
         | No. Some technical people just gave their non-definitive two
         | cents.
        
       | helsinkiandrew wrote:
       | Can someone with more knowledge of Nvidia GPU's please say how
       | much the V100 costs ($5-10K?) compared with the $900 mac mini.
        
         | fxtentacle wrote:
         | You would instead buy a used 1080 (no ti) for similar
         | performance.
         | 
         | The special thing about the V100 is that it's driver EULA
         | allows data center usage. If you don't need that, there are
         | other much cheaper options.
        
           | littlestymaar wrote:
           | > The special thing about the V100 is that it's driver EULA
           | allows data center usage.
           | 
           | Wait what? Is it the only thing?
           | 
           | That sounds hard to believe: if true, using the open driver
           | (Nouveau) instead of Nvidia's proprietary one would be a
           | massive money saver for datacenters operators (and even if
           | Nouveau doesn't support the features you'd want already,
           | supporting their development would be much cheaper for a
           | company like Amazon than paying a premium on every GPU they
           | buy)
        
             | fsh wrote:
             | Nouveau does not support CUDA and is therefore not usable
             | for GPU computing on Nvidia.
        
             | YetAnotherNick wrote:
             | NVIDIA has EULA to prevent data centre use of their
             | hardware. Also, NVIDIA does not allow bulk buying of RTX
             | series.
        
               | alickz wrote:
               | They barely allow single buying for the 30 series :(
               | 
               | Took me quite a while to get my hands on a 3080.
        
               | jklehm wrote:
               | What ended up working for you?
        
             | rrss wrote:
             | No, that's not the only thing.
             | 
             | Other characteristics of V100 that may be interesting to
             | people buying GPUs for data centers:
             | 
             | - higher capacity GPU memory. 1080 has 8 GB, V100 has 16 or
             | 32 GB.
             | 
             | - higher bandwidth GPU memory. V100 has HBM2 with a peak of
             | 900 GB/s, 1080 has G5X with a peak of ~300 GB/s.
             | 
             | - ECC support.
             | 
             | - data center certification + warranty
             | 
             | (The geforce warranty covers normal consumer usage, like
             | gaming, and does not cover datacenter use)
             | 
             | - availability of enterprise support contracts.
             | 
             | (If you are buying a ton of GPUs to put in a datacenter,
             | you probably don't want to end up on the normal consumer
             | support line when something goes wrong)
             | 
             | - fast fp64
             | 
             | There are probably others
        
               | littlestymaar wrote:
               | Thanks, that makes much more sense!
        
           | sillysaurusx wrote:
           | Don't buy hardware in general for AI work, IMO. It'll be out
           | of date in a year and you'll end up training in the cloud
           | anyway.
        
             | dx034 wrote:
             | If you properly utilize your hardware, on premise (or
             | colocation in an area with cheap electricity prices) is
             | vastly cheaper and will likely continue to be for a while.
             | I don't see how training models in the cloud makes
             | financial sense for organizations that can utilize their
             | hardware 24/7.
             | 
             | For all others with burst workloads training in the cloud
             | can make sense, but that has been the case for a while
             | already.
        
               | sillysaurusx wrote:
               | We're not talking about organizations, though. I don't
               | agree with your premise, either. People aren't training
               | models 24/7, so the idea that it's "vastly cheaper and
               | will continue to be for a while" isn't true.
        
               | king_magic wrote:
               | > People aren't training models 24/7
               | 
               | ... uh, you sure about that? Let me go check on the 3
               | models I have concurrently training for my organization
               | on 3 separate GPU servers (all 2 year old hardware to
               | boot) that have been running continuously for the past 36
               | hours. It pretty much works out to 24/7 training for the
               | past several months.
               | 
               | And BTW, this is _massively_ cheaper for us than training
               | in the cloud.
        
               | qayxc wrote:
               | Instead of arguing back and forth, how about a test case
               | instead?
               | 
               | Pretraining BERT takes 44 minutes on 1024 V100 GPUs [1]
               | 
               | This requires dedicated instances, since shared instances
               | won't be able to get to peak performance if only because
               | of the "noisy neighbour"-effect.
               | 
               | At GCP, a V100 costs $2.48/h [2], so Microsoft's
               | experiment would've cost $2,539.52.
               | 
               | Smaller providers offer the same GPU at just $1.375/h
               | [3], so a reasonable lower limit would be around $1,408.
               | 
               | For a single BERT pretraining, provided highly optimised
               | workflows and distributed training scripts are already at
               | hand, renting a GPU for single training tasks seems to be
               | the way to go.
               | 
               | The cost of V100-equivalent end-user hardware (we don't
               | need to run in a datacentre, dedicated workstations will
               | do), is about $6,000 (e.g. a Quadro RTX 6000), provided
               | you don't need double precision. The card will have equal
               | FP32 performance, lower TGP and VRAM that sits between
               | the 16 GB and 32 GB version of the V100.
               | 
               | Workstation hardware to go with such card will cost about
               | $2,000, so $8,000 are a reasonable cost estimation. The
               | cost of electricity varies between regions, but in the EU
               | the average non-household price is about 0.13EUR/kWh [4].
               | 
               | Pretraining BERT therefore costs an estimated 1024 h *
               | 0.13EUR/kWh * 0.5 kW [?] 57EUR in electricity (power
               | consumption estimated from TGP + typical power
               | consumptions of an Intel Xeon workstation from my own
               | measurements when training models).
               | 
               | In order to get the break-even point we can use the
               | following equation: t * $1,408 = $8,000 + t * $69, which
               | results in t = 8,000/(1408-69) or t > 5.
               | 
               | In short, if you pretrain BERT 6 times, you safe money by
               | BUYING a workstation and running it locally over renting
               | cloud GPUs from a reasonably cheap provider.
               | 
               | This example only concerns BERT, but you can use the same
               | reasoning for any model that you know the required
               | compute time and VRAM requirements of.
               | 
               | This only concerns training, too - inference is a whole
               | different can of worms entirely.
               | 
               | [1] https://www.deepspeed.ai/news/2020/05/27/fastest-
               | bert-traini...
               | 
               | [2] https://cloud.google.com/compute/gpus-pricing
               | 
               | [3] https://www.exoscale.com/syslog/new-tesla-v100-gpu-
               | offering/
               | 
               | [4] https://ec.europa.eu/eurostat/statistics-
               | explained/index.php...
        
           | spi wrote:
           | "Similar performance" still means 30%-50% slower [1] and half
           | the RAM, not really that comparable.
           | 
           | For much closer performance you should get a 2080ti, which
           | should be roughly comparable in speed and have 11GB [edit:
           | wrongly wrote 14GB before] of memory (against the 16GB for
           | the V100). Price-wise you still save a lot of money, after
           | quickly googling around, roughly $1200 vs. $15k-$20k.
           | 
           | But you still lose something, e.g. if you use half precision
           | on V100 you get virtually double speed, if you do on a 1080 /
           | 2080 you get... nothing because it's not supported.
           | 
           | (and more importantly for companies, you can actually use
           | only V100-style stuff on servers [edit: as you mentioned
           | already, although I'm not 100% sure it's just drivers that
           | are the issue?])
           | 
           | [1] I've not used 1080 myself, but I've used 1080ti and V100
           | extensively, and the latter is about 30% faster. Hence my
           | estimate for comparison with 1080
        
             | FeepingCreature wrote:
             | How does AMD stuff like Radeon VII or MI100 hold up?
        
               | fxtentacle wrote:
               | Can't use it because most AI frameworks won't run on AMD
               | because they did not implement suitable back-ends (yet).
        
               | breuleux wrote:
               | There's one for PyTorch, I tested it about a year ago.
               | You have to compile it from scratch and IIRC it
               | translates/compile CUDA to ROCm at runtime which causes
               | noticeable pauses on the first run. There may be other
               | tweaks you have to do too. Once set up it performs
               | decently, though.
        
             | trott wrote:
             | > But you still lose something, e.g. if you use half
             | precision on V100 you get virtually double speed, if you do
             | on a 1080 / 2080 you get... nothing because it's not
             | supported.
             | 
             | That's not true. FP16 is supported and can be fast on 2080,
             | although some frameworks fail to see the speed-up. I filed
             | a bug report about this a year ago:
             | https://github.com/apache/incubator-mxnet/issues/17665
             | 
             | What consumer GPUs lack is ECC and fast FP64.
        
             | fxtentacle wrote:
             | For my workload (optical flow) I was honestly surprised to
             | see that the Google Cloud V100 was not faster than my local
             | GTX 1080. So I guess that varies a lot by how you're
             | training, too.
             | 
             | For many of my AI training workloads, already the 1080 is
             | "fast enough" and the CPU or SSDs are the bottleneck. In
             | that case, GPU doesn't really matter that much.
        
               | spi wrote:
               | Yes that might be the case. In my case I mostly trained
               | big (tens to hundreds of millions of parameters) networks
               | mostly made of 3x3 convolutions, and I think the V100 has
               | dedicated hardware for that. Then as I mentioned you can
               | get a further 2x speedup by using half precision.
               | 
               | If you train smaller models, or RNN, you probably lose
               | most of the gains of dedicated hardware. But I guess that
               | for this same reason the experiments in the article are
               | little more than a provocation, I don't know if you could
               | train a big network in finite time on M1 chips...
               | 
               | That said, of course, if the budget was mine, I wouldn't
               | buy a V100 :-)
        
       | tpoacher wrote:
       | Betteridge says no.
        
         | coldtea wrote:
         | And Betteridge is wrong.
        
       | ilikedthatone wrote:
       | no reading if it is forced to use js. your ideas does not even
       | matter if you wish me to use js to just learn what your ideas
       | are.
        
         | iforgotpassword wrote:
         | This is OT, but since this comment is already here: I usually
         | browse HN on WP8.1 with IE11. Now I don't expect people to care
         | for that platform anymore in 2021, but in this case it was
         | especially ridiculous since the page actually loaded the full
         | article, but then about 5 seconds later it was replaced by a
         | "oops something went wrong..." message.
        
       | 0x008 wrote:
       | Well, putting out a tl;dr and then a graph that does not mention
       | FP16/FP32 performance differences or anything related to TensorRT
       | cannot be taken seriously if we talk about performance per watt.
       | We need to see the a comparison that includes multiple scenarios
       | so we can determine something like a break-even point between
       | Nvidia GPUs and Apple M1 GPU, possibly even for several SotA
       | models.
        
       | baxter001 wrote:
       | No, but it's pretty good at retraining the final layer of low
       | memory networks like MobileNet - weirdly a workload that the V100
       | is very poorly suited for...
        
         | xiphias2 wrote:
         | What about the M1X that will come with 64GB RAM? I'm thinking
         | of waiting for that to come out. Ah...I just see that the
         | article authors are waiting for it as well
        
         | enos_feedler wrote:
         | Not surprising since this is a training use case that Apple
         | very much focuses on with CreateML.
        
       | tbalsam wrote:
       | This is on a model designed to run faster on CPUs. It's like
       | dropping a bowling ball on your foot and claiming excitement that
       | you feel bruised after a few days.
       | 
       | Maybe there's something interesting there, definitely, but the
       | overhype of the title takes away any significant amount of clout
       | I'd give to the publishers for research. If you find something
       | interesting, say it, and stop making vapid generalizations for
       | the sake of more clicks.
       | 
       | Remember, we only can feed the AI hype bubble when we do this. It
       | might be good results, but we need to be at least realistic about
       | it, or there won't be an economy of innovation for people to
       | listen to in the future, because they've tuned it out with all of
       | the crap marketing that comes/came before it.
       | 
       | Thanks for coming to my TED Talk!
        
         | lukas wrote:
         | I don't think MobileNetV2 is designed to train on GPUs -
         | according to this https://azure.microsoft.com/en-us/blog/gpus-
         | vs-cpus-for-depl... MobileNetV2 gets bigger gains from GPUs vs
         | several CPUs than ResNet. You could argue the batch size
         | doesn't fully use the V100 but these comparisons are tricky and
         | this looks like fairly normal training to me.
         | 
         | It's pretty surprising to me that an M1 performs anywhere near
         | a V100 on model training and I guess the most striking thing is
         | the energy efficiency of the M1.
        
           | tbalsam wrote:
           | MV2 is memory-limited, the depthwise + groups + 1x1 convs has
           | a long launch time on GPU. Shattered kernels are fine for
           | CPU, but not for GPU.
           | 
           | Though per your note on the scales, that's really interesting
           | empirical results. I'll have to look into that, thanks for
           | passing that along.
        
         | [deleted]
        
       | SloopJon wrote:
       | The first graph includes "Apple Intel", which is not mentioned
       | anywhere else in the post. Any idea what hardware that was, and
       | whether it used the accelerated TensorFlow?
        
         | vanpelt wrote:
         | My bad, this was using non-Accelerated TensorFlow on a 2.3GHz
         | 8-Core i9.
        
       | volta87 wrote:
       | When developing ML models, you rarely train "just one".
       | 
       | The article mentions that they explored a not-so-large hyper-
       | parameter space (i.e. they trained multiple models with different
       | parameters each).
       | 
       | It would be interesting to know how long does the whole process
       | takes on the M1 vs the V100.
       | 
       | For the small models covered in the article, I'd guess that the
       | V100 can train them all concurrently using MPS (multi-process
       | service: multiple processes can concurrently use the GPU).
       | 
       | In particular it would be interesting to know, whether the V100
       | trains all models in the same time that it trains one, and
       | whether the M1 does the same, or whether the M1 takes N times
       | more time to train N models.
       | 
       | This could paint a completely different picture, particularly for
       | the user perspective. When I go for lunch, coffee, or home, I
       | usually spawn jobs training a large number of models, such that
       | when I get back, all these models are trained.
       | 
       | I only start training a small number of models at the latter
       | phases of development, when I have already explored a large part
       | of the model space.
       | 
       | ---
       | 
       | To make the analogy, what this article is doing is something
       | similar to benchmarking a 64 core CPU against a 1 core CPU using
       | a single threaded benchmark. The 64 core CPU happens to be
       | slightly beefier and faster than the 1 core CPU, but it is more
       | expensive and consumes more power because... it has 64x more
       | cores. So to put things in perspective, it would make sense to
       | also show a benchmark that can use 64x cores, which is the reason
       | somebody would buy a 64-core CPU, and see how the single-core one
       | compares (typically 64x slower).
       | 
       | ---
       | 
       | To me, the only news here is that Apple GPU cores are not very
       | far behind NVIDIA's cores for ML training, but there is much more
       | to a GPGPU than just the perf that you get for small models in a
       | small number of cores. Apple would still need to (1) catch up,
       | and (2) extremely scale up their design. They probably can do
       | both if they set their eyes on it. Exciting times.
        
         | nightcracker wrote:
         | > When developing ML models, you rarely train "just one".
         | 
         | Depends on your field. In Reinforcement Learning you often
         | really do train _just one_ , at least on the same data set
         | (since the data set often is dynamically generated based on the
         | behavior of the previous iteration of the model).
        
           | volta87 wrote:
           | Even in reinforcement learning you can train multiple model
           | with different data-sets concurrently and combine them for
           | the next iteration.
        
         | sdenton4 wrote:
         | The low gpu utilization rate in the first graph is kind of a
         | tell... Seems like the M1 is a little bit worse than 40% of a
         | v100?
        
         | lukas wrote:
         | Do you really train more than one model at the same time on a
         | single GPU? In my experience that's pretty unusual.
         | 
         | I completely agree with your conclusion here.
        
           | junipertea wrote:
           | I found training multiple models on same GPU hit other
           | bottlenecks (mainly memory capacity/bandwidth) fast. I tend
           | to train one model per GPU and just scale the number of
           | computers. Also, if nothing else, we tend to push the models
           | to fit the GPU memory.
        
       | jlouis wrote:
       | CPUs often outperform specialized hardware on small models. This
       | is nothing new. You'd need to go to a larger model, and then
       | power consumption curves change too.
        
       | StavrosK wrote:
       | I'm seeing a lot of M1 hype, and I suspect most of it us
       | unwarranted. I looked at comparisons between the M1 and the
       | latest Ryzens, and it looks like it's comparable? Does anyone
       | know details? I only looked summarily.
        
         | ZeroCool2u wrote:
         | The main hype is that performance is similar, but the M1 does
         | it with a lot less power draw. The performance itself isn't too
         | crazy. It's just crazy that it does it with a somewhat similar
         | power draw to a high end phone.
        
       | sradman wrote:
       | I categorize this as an exploration of how to benchmark
       | desktop/workstation NPUs [1] similar to the exploration Daniel
       | Lemire started with SIMD. Mobile SoC NPUs are used to deploy
       | inference models on smartphones and IoT devices while discreet
       | NPUs like Nvidia A100/V100 target cloud clusters.
       | 
       | We don't have apples-to-apples benchmarks like SPECint/SPECfp for
       | the SoC accelerators in the M1 (GPU, NPU, etc.) so these early
       | attempts are both facile and critical as we try to categorize and
       | compare the trade-offs between the SoC/discreet and
       | performance/perf-per-watt options available.
       | 
       | Power efficient SoC for desktops is new and we are learning as we
       | go.
       | 
       | [1] https://en.m.wikipedia.org/wiki/AI_accelerator
        
         | volta87 wrote:
         | > We don't have apples-to-apples benchmarks
         | 
         | We do: https://mlperf.org/
         | 
         | Just run their benchmarks. Submitting your results there is a
         | bit more complicated, because all results there are "verified"
         | by independent entities.
         | 
         | If you feel like your AI use case is not well represented by
         | any of the MLPerf benchmarks, open a discussion thread about
         | it, propose a new benchmark, etc.
         | 
         | The set of benchmarks there increases all the time to cover new
         | applications. For example, on top of the MLPerf Training and
         | MLPerf Inference benchmark suites, we now have a new MLPerf HPC
         | suite to capture ML of very large models.
        
           | solidasparagus wrote:
           | Those benchmarks are absurdly tuned to the hardware. Just
           | look at the result Google gets with BERT on V100s vs the
           | result NVIDIA gets with V100s. It's an interesting
           | measurement of what experts can achieve when they modify
           | their code to run on the hardware they understand well, but
           | it isn't useful beyond that.
        
             | volta87 wrote:
             | > Just look at the result Google gets with BERT on V100s vs
             | the result NVIDIA gets with V100s.
             | 
             | These benchmarks measure the combination of
             | hardware+software to solve a problem.
             | 
             | Google and NVIDIA are using the same hardware, but their
             | software implementation is different.
             | 
             | ---
             | 
             | The reason mlperf.org exists is to have a meaningful set of
             | relevant practical ML problems that can be used to compare
             | and improve hardware and software for ML.
             | 
             | For any piece of hardware, you can create an ML benchmark
             | that's irrelevant in practice, but perform much better on
             | that hardware than the competition. That's what we used to
             | have before mlperf.org was a thing.
             | 
             | We shouldn't go back there.
        
           | sradman wrote:
           | > on top of the MLPerf Training and MLPerf Inference
           | benchmark suites, we now have a new MLPerf HPC suite to
           | capture ML of very large models.
           | 
           | I think the challenge is selecting the tests that best
           | represent the typical ML/DL use cases for the M1 and
           | comparing it to an alternative such as the V100 using a
           | common toolchain like Tensorflow. One of the problems that I
           | see is that the optimizer/codegen of the toolchain is a key
           | component; the M1 has both GPU and Neural Engine and we don't
           | know which accelerator is targeted or even possibly both.
           | Should we benchmark ML Create on M1 vs A14 or A12X? Perhaps
           | it is my ignorance but I don't think we are at a point where
           | our existing benchmarks can be applied meaningfully with the
           | M1 but I'm sure we will get there soon.
        
       | procrastinatus wrote:
       | One thing I haven't seen much mention of is getting things to run
       | on the M1's neural engine instead of the GPU - it seems like the
       | neural engine has ~3x more compute capacity and is specifically
       | optimized for this type of computation.
       | 
       | Has anyone spotted any work allowing a mainstream tensor library
       | (e.g. jax, tf, pytorch) to run on the neural engine?
        
         | lldbg wrote:
         | George hotz got his "for play" tensor library[a] to run on the
         | Apple Neural Engine (ANE). The results were somewhat
         | dissapointing, however, and currently it only does relu.
         | 
         | [a]: https://github.com/geohot/tinygrad
        
       | mark_l_watson wrote:
       | I had the same experience. My M1 system does well on smaller
       | models compared to a NVidia 1070 with 10GB of memory. My MacBook
       | Pro only has 8GB total memory. Large models run slowly.
       | 
       | I found setting up Apple's M1 fork of TensorFlow to be fairly
       | easy, BTW.
       | 
       | I am writing a new book on using Swift for AI applications,
       | motivated by the "niceness" of the Swift language and Apple's
       | CoreML libraries.
        
         | iluxonchik wrote:
         | do you happen to have a draft version available somewhere? i'm
         | diving into ML with Swift soon
        
       | fxtentacle wrote:
       | "trainable_params 12,810"
       | 
       |  _laughs_
       | 
       | (for comparison, GPT3: 175,000,000,000 parameters)
       | 
       | Can Apple's M1 help you train tiny toy examples with no real-
       | world relevance? You bet it can!
       | 
       | Plus it looks like they are comparing Apples to Oranges ;) This
       | seems to be 16 bit precision on the M1 and 32 bit on the V100. So
       | the M1-trained model will most likely yield worse or unusable
       | results, due to lack of precision.
       | 
       | And lastly, they are plainly testing against the wrong target.
       | The V100 is great, but it is far from NVIDIA's flagship for
       | training small low-precision models. At the FP16 that the M1 is
       | using, the correct target would have been an RTX 3090 or the
       | like, which has 35 TFLOPS. The V100 only gets 14 TFLOPS because
       | it lacks the dedicated TensorRT accelerator hardware.
       | 
       | So they compare the M1 against an NVIDIA model from 2017 that
       | lacks the relevant hardware acceleration and, thus, is a whopping
       | 60% slower than what people actually use for such training
       | workloads.
       | 
       | I'm sure my bicycle will also compare very favorably against a
       | car that is lacking two wheels :p
        
         | JacobSuperslav wrote:
         | thanks for the thorough comment. the article is, unfortunately,
         | just clickbait.
        
           | joseph_grobbles wrote:
           | Thorough? Their comment is noisy snark.
           | 
           | A huge number of models are "small". I'm currently training
           | game units for autonomous behaviors. The M1 is _massively_
           | oversized for my need.
           | 
           | Saying "Oh look, GPT-3" just stupidifies the conversation,
           | and is classic dismissive nonsense.
        
           | coldtea wrote:
           | The comment is bogus empty snark (and factually wrong).
           | 
           | The arguments made (and I use the word arguments loosely):
           | 
           | "Too few trainable_params compared to GTP3".
           | 
           | GTP3 is several orders of magnitude higher than what people
           | train, and so it's a useless comparison. It's like we're
           | comparing a bike to an e-bike, and someone says "yeah, but
           | can the e-bike run faster than a rocket?"
           | 
           | Second argument "Sure, it's faster than a machine that costs
           | 3-4 fives more, but you should instead compare it to a
           | machine that costs even more than that".
           | 
           | I can only take it as a troll comment.
        
           | nvarsj wrote:
           | It seems like a common trend with M1 articles on HN lately.
        
         | YetAnotherNick wrote:
         | For the first graph:                 trainable parameters:
         | 2236682
        
           | qayxc wrote:
           | So it's a toy model...
        
         | coolness wrote:
         | This, not to mention one could get the GPU usage on the V100
         | way higher by training with larger batch sizes, which would
         | also make training much faster.
        
         | jbverschoor wrote:
         | Even the RTX 3090 is double the price of an M1 for just 1 card.
         | 
         | The V100 is almost 5-10x the price of an M1.
        
         | iaml wrote:
         | GPT3 is so big it would take 355 years to train on a nvidia
         | V100, so your example is also not really useful for comparison.
         | It would be interesting to see some mid-sized nn benchmarks
         | though.
        
         | Firadeoclus wrote:
         | > The V100 only gets 14 TFLOPS because it lacks the dedicated
         | TensorRT accelerator hardware.
         | 
         | V100 has both vec2 hfma (i.e. fp16 multiply-add is twice the
         | rate of fp32), getting ~30 TFLOPS, and tensor cores which can
         | achieve up to 4x that for matrix multiplications.
        
         | apl wrote:
         | Hard disagree. V100s are a perfectly valid comparison point.
         | They're usually what's available at scale (on AWS, in private
         | clusters, etc.) because nobody's rolled out enough A100s at
         | this point. If you look at any paper from OpenAI et al.
         | (basically: not Google), you'll see performance numbers for
         | large V100 clusters.
        
           | fxtentacle wrote:
           | Yes and you'll see parameters tuned for V100, not parameters
           | tuned for m1 somehow limping along on a V100 in emulation
           | mode.
           | 
           | I wouldn't complain about a benchmark executing any real
           | world SOTA model on m1 and V100, but those will most likely
           | not even run on the M1 due to memory constraints.
           | 
           | So this article is like using an ios game to evaluate a Mac
           | pro. You can do it, but it's not really useful.
        
             | YetAnotherNick wrote:
             | You can count the number of GPUs having more than M1
             | memory(16 GB) in a single hand.
        
               | oblio wrote:
               | Isn't the M1 GPU memory shared with everything else? Can
               | the GPU realistically used that much? Won't the OS and
               | base apps use up at least 2-3GB?
        
               | qayxc wrote:
               | The M1 can only address 8 GB with its NPU/GPU.
        
       ___________________________________________________________________
       (page generated 2021-01-14 23:02 UTC)