[HN Gopher] Analyzing the performance of Tensorflow training on ...
___________________________________________________________________
Analyzing the performance of Tensorflow training on M1 Mac Mini and
Nvidia V100
Author : briggers
Score : 219 points
Date : 2021-01-14 07:04 UTC (15 hours ago)
(HTM) web link (wandb.ai)
(TXT) w3m dump (wandb.ai)
| lopuhin wrote:
| > I chose MobileNetV2 to make iteration faster. When I tried
| ResNet50 or other larger models the gap between the M1 and Nvidia
| grew wider.
|
| (and that's on CIFAR-10). But why not report these results and
| also test on a more realistic datasets? The internet is full of
| M1 TF brenchmarks on CIFAR or MNIST, has anyone seen something
| different?
| sillysaurusx wrote:
| Hehe. That criticism could be applied to ML itself. :)
|
| I wish ML used more than CIFNISTNet, but unfortunately there's
| not a lot of standard datasets yet. (Even Imagenet is an
| absolute pain to set up.)
| sdenton4 wrote:
| Tensorflow Datasets includes a lot of the 'standard' datasets
| in a way that's dead simple to call up and use (including ~10
| variants of imagenet):
| https://www.tensorflow.org/datasets/catalog/overview
| JohnHaugeland wrote:
| "Can Apple's M1 do a good job? We cut things down to unrealstic
| sizes, turned off cores, and p-hacked as hard as we could until
| we found a way to pretend the answer was yes"
| whywhywhywhy wrote:
| >We can see better performance gains with the m1 when there are
| fewer weights to train likely due to the superior memory
| architecture of the M1.
|
| Wasn't this whole "M1 memory" thing decided to be a myth now some
| more technical people have dissected it?
| rsynnott wrote:
| As with many things, there isn't one "M1 memory" thing. It's a
| combination of myth and real stuff. No, it isn't ultra-low
| latency or high-bandwidth. But on the other hand, single core
| achievable bandwidth is very high.
| iforgotpassword wrote:
| Myth or not, it's memory bandwidth is amazing, so I guess that
| helps.
| vletal wrote:
| Could you please provide some resources on how the unified
| memory model supposedly works? Why is it a "myth"?
| tyingq wrote:
| I believe it's referring not to unified memory, but some
| speculation that the memory being closer to the CPU makes
| some notable difference. That line was in a fair amount of
| the initial articles about the M1, and fits the "myth"
| description.
| gameswithgo wrote:
| I think two different memory things are being talked about:
|
| 1. There is an idea that M1 has RAM that is vastly higher
| bandwidth than intel/amd machines. In reality it is the same
| laptop ddr ram that other machines have, though at a very high
| clock rate. Not higher than the best intel laptops though. So
| the bandwidth is not any more amazing than a top end Intel
| laptop, and latency is no different.
|
| 2. But in this case I believe they are talking about the CPU
| and GPU both being able to freely access the same ram, as
| compared to a setup where you have a discrete GPU with it's own
| ram, where data must first be copied to the GPU ram for the GPU
| to do something with it. In some workloads this can be an
| inferior approach, in others it can be superior, as the GPU's
| ram is faster. The M1 model again isn't unique, as its similar
| to how game consoles work, I believe.
| dragontamer wrote:
| > In reality it is the same laptop ddr ram that other
| machines have
|
| LPDDR4 is more well known for cell phones than laptops
| actually. I think it shows the stagnation of the laptop
| market (and DDR4) that LPDDR4 is really catching up (and then
| some). Or maybe... because cell phones are more widespread
| these days, cell phones just naturally get the better tech?
|
| On the other hand, M1 is pretty wide. Apple clearly is
| tackling the memory bottleneck very strongly in its design.
|
| DDR5 is going to be the next major step forward for
| desktops/laptops.
|
| > 2. But in this case I believe they are talking about the
| CPU and GPU both being able to freely access the same ram, as
| compared to a setup where you have a discrete GPU with it's
| own ram, where data must first be copied to the GPU ram for
| the GPU to do something with it. In some workloads this can
| be an inferior approach, in others it can be superior, as the
| GPU's ram is faster. The M1 model again isn't unique, as its
| similar to how game consoles work, I believe.
|
| More than just the "same RAM", but probably even shares the
| same last-level cache. Both AMD's chips and Intel's iGPUs
| share the cache with its CPU/GPU hybrid architectures.
|
| However: it seems like on-core SIMD units (aka: AVX or ARM
| NEON / SVE) are even lower latency, since those share L1
| cache.
|
| Any situation where you need low latency but SIMD, it makes
| more sense to use AVX / SVE than even waiting for L3 cache to
| talk to the iGPU. Any situation where you need massive
| parallelism, a dedicated 3090 is more useful.
|
| Its going to be tough to figure out a good use of iGPUs:
| they're being squeezed on the latency front (by things like
| A64Fx: 512-bit ARM SIMD, as well as AVX512 on the Intel
| side), and also squeezed by the bandwidth front (by classic
| GPUs)
| coldtea wrote:
| No. Some technical people just gave their non-definitive two
| cents.
| helsinkiandrew wrote:
| Can someone with more knowledge of Nvidia GPU's please say how
| much the V100 costs ($5-10K?) compared with the $900 mac mini.
| fxtentacle wrote:
| You would instead buy a used 1080 (no ti) for similar
| performance.
|
| The special thing about the V100 is that it's driver EULA
| allows data center usage. If you don't need that, there are
| other much cheaper options.
| littlestymaar wrote:
| > The special thing about the V100 is that it's driver EULA
| allows data center usage.
|
| Wait what? Is it the only thing?
|
| That sounds hard to believe: if true, using the open driver
| (Nouveau) instead of Nvidia's proprietary one would be a
| massive money saver for datacenters operators (and even if
| Nouveau doesn't support the features you'd want already,
| supporting their development would be much cheaper for a
| company like Amazon than paying a premium on every GPU they
| buy)
| fsh wrote:
| Nouveau does not support CUDA and is therefore not usable
| for GPU computing on Nvidia.
| YetAnotherNick wrote:
| NVIDIA has EULA to prevent data centre use of their
| hardware. Also, NVIDIA does not allow bulk buying of RTX
| series.
| alickz wrote:
| They barely allow single buying for the 30 series :(
|
| Took me quite a while to get my hands on a 3080.
| jklehm wrote:
| What ended up working for you?
| rrss wrote:
| No, that's not the only thing.
|
| Other characteristics of V100 that may be interesting to
| people buying GPUs for data centers:
|
| - higher capacity GPU memory. 1080 has 8 GB, V100 has 16 or
| 32 GB.
|
| - higher bandwidth GPU memory. V100 has HBM2 with a peak of
| 900 GB/s, 1080 has G5X with a peak of ~300 GB/s.
|
| - ECC support.
|
| - data center certification + warranty
|
| (The geforce warranty covers normal consumer usage, like
| gaming, and does not cover datacenter use)
|
| - availability of enterprise support contracts.
|
| (If you are buying a ton of GPUs to put in a datacenter,
| you probably don't want to end up on the normal consumer
| support line when something goes wrong)
|
| - fast fp64
|
| There are probably others
| littlestymaar wrote:
| Thanks, that makes much more sense!
| sillysaurusx wrote:
| Don't buy hardware in general for AI work, IMO. It'll be out
| of date in a year and you'll end up training in the cloud
| anyway.
| dx034 wrote:
| If you properly utilize your hardware, on premise (or
| colocation in an area with cheap electricity prices) is
| vastly cheaper and will likely continue to be for a while.
| I don't see how training models in the cloud makes
| financial sense for organizations that can utilize their
| hardware 24/7.
|
| For all others with burst workloads training in the cloud
| can make sense, but that has been the case for a while
| already.
| sillysaurusx wrote:
| We're not talking about organizations, though. I don't
| agree with your premise, either. People aren't training
| models 24/7, so the idea that it's "vastly cheaper and
| will continue to be for a while" isn't true.
| king_magic wrote:
| > People aren't training models 24/7
|
| ... uh, you sure about that? Let me go check on the 3
| models I have concurrently training for my organization
| on 3 separate GPU servers (all 2 year old hardware to
| boot) that have been running continuously for the past 36
| hours. It pretty much works out to 24/7 training for the
| past several months.
|
| And BTW, this is _massively_ cheaper for us than training
| in the cloud.
| qayxc wrote:
| Instead of arguing back and forth, how about a test case
| instead?
|
| Pretraining BERT takes 44 minutes on 1024 V100 GPUs [1]
|
| This requires dedicated instances, since shared instances
| won't be able to get to peak performance if only because
| of the "noisy neighbour"-effect.
|
| At GCP, a V100 costs $2.48/h [2], so Microsoft's
| experiment would've cost $2,539.52.
|
| Smaller providers offer the same GPU at just $1.375/h
| [3], so a reasonable lower limit would be around $1,408.
|
| For a single BERT pretraining, provided highly optimised
| workflows and distributed training scripts are already at
| hand, renting a GPU for single training tasks seems to be
| the way to go.
|
| The cost of V100-equivalent end-user hardware (we don't
| need to run in a datacentre, dedicated workstations will
| do), is about $6,000 (e.g. a Quadro RTX 6000), provided
| you don't need double precision. The card will have equal
| FP32 performance, lower TGP and VRAM that sits between
| the 16 GB and 32 GB version of the V100.
|
| Workstation hardware to go with such card will cost about
| $2,000, so $8,000 are a reasonable cost estimation. The
| cost of electricity varies between regions, but in the EU
| the average non-household price is about 0.13EUR/kWh [4].
|
| Pretraining BERT therefore costs an estimated 1024 h *
| 0.13EUR/kWh * 0.5 kW [?] 57EUR in electricity (power
| consumption estimated from TGP + typical power
| consumptions of an Intel Xeon workstation from my own
| measurements when training models).
|
| In order to get the break-even point we can use the
| following equation: t * $1,408 = $8,000 + t * $69, which
| results in t = 8,000/(1408-69) or t > 5.
|
| In short, if you pretrain BERT 6 times, you safe money by
| BUYING a workstation and running it locally over renting
| cloud GPUs from a reasonably cheap provider.
|
| This example only concerns BERT, but you can use the same
| reasoning for any model that you know the required
| compute time and VRAM requirements of.
|
| This only concerns training, too - inference is a whole
| different can of worms entirely.
|
| [1] https://www.deepspeed.ai/news/2020/05/27/fastest-
| bert-traini...
|
| [2] https://cloud.google.com/compute/gpus-pricing
|
| [3] https://www.exoscale.com/syslog/new-tesla-v100-gpu-
| offering/
|
| [4] https://ec.europa.eu/eurostat/statistics-
| explained/index.php...
| spi wrote:
| "Similar performance" still means 30%-50% slower [1] and half
| the RAM, not really that comparable.
|
| For much closer performance you should get a 2080ti, which
| should be roughly comparable in speed and have 11GB [edit:
| wrongly wrote 14GB before] of memory (against the 16GB for
| the V100). Price-wise you still save a lot of money, after
| quickly googling around, roughly $1200 vs. $15k-$20k.
|
| But you still lose something, e.g. if you use half precision
| on V100 you get virtually double speed, if you do on a 1080 /
| 2080 you get... nothing because it's not supported.
|
| (and more importantly for companies, you can actually use
| only V100-style stuff on servers [edit: as you mentioned
| already, although I'm not 100% sure it's just drivers that
| are the issue?])
|
| [1] I've not used 1080 myself, but I've used 1080ti and V100
| extensively, and the latter is about 30% faster. Hence my
| estimate for comparison with 1080
| FeepingCreature wrote:
| How does AMD stuff like Radeon VII or MI100 hold up?
| fxtentacle wrote:
| Can't use it because most AI frameworks won't run on AMD
| because they did not implement suitable back-ends (yet).
| breuleux wrote:
| There's one for PyTorch, I tested it about a year ago.
| You have to compile it from scratch and IIRC it
| translates/compile CUDA to ROCm at runtime which causes
| noticeable pauses on the first run. There may be other
| tweaks you have to do too. Once set up it performs
| decently, though.
| trott wrote:
| > But you still lose something, e.g. if you use half
| precision on V100 you get virtually double speed, if you do
| on a 1080 / 2080 you get... nothing because it's not
| supported.
|
| That's not true. FP16 is supported and can be fast on 2080,
| although some frameworks fail to see the speed-up. I filed
| a bug report about this a year ago:
| https://github.com/apache/incubator-mxnet/issues/17665
|
| What consumer GPUs lack is ECC and fast FP64.
| fxtentacle wrote:
| For my workload (optical flow) I was honestly surprised to
| see that the Google Cloud V100 was not faster than my local
| GTX 1080. So I guess that varies a lot by how you're
| training, too.
|
| For many of my AI training workloads, already the 1080 is
| "fast enough" and the CPU or SSDs are the bottleneck. In
| that case, GPU doesn't really matter that much.
| spi wrote:
| Yes that might be the case. In my case I mostly trained
| big (tens to hundreds of millions of parameters) networks
| mostly made of 3x3 convolutions, and I think the V100 has
| dedicated hardware for that. Then as I mentioned you can
| get a further 2x speedup by using half precision.
|
| If you train smaller models, or RNN, you probably lose
| most of the gains of dedicated hardware. But I guess that
| for this same reason the experiments in the article are
| little more than a provocation, I don't know if you could
| train a big network in finite time on M1 chips...
|
| That said, of course, if the budget was mine, I wouldn't
| buy a V100 :-)
| tpoacher wrote:
| Betteridge says no.
| coldtea wrote:
| And Betteridge is wrong.
| ilikedthatone wrote:
| no reading if it is forced to use js. your ideas does not even
| matter if you wish me to use js to just learn what your ideas
| are.
| iforgotpassword wrote:
| This is OT, but since this comment is already here: I usually
| browse HN on WP8.1 with IE11. Now I don't expect people to care
| for that platform anymore in 2021, but in this case it was
| especially ridiculous since the page actually loaded the full
| article, but then about 5 seconds later it was replaced by a
| "oops something went wrong..." message.
| 0x008 wrote:
| Well, putting out a tl;dr and then a graph that does not mention
| FP16/FP32 performance differences or anything related to TensorRT
| cannot be taken seriously if we talk about performance per watt.
| We need to see the a comparison that includes multiple scenarios
| so we can determine something like a break-even point between
| Nvidia GPUs and Apple M1 GPU, possibly even for several SotA
| models.
| baxter001 wrote:
| No, but it's pretty good at retraining the final layer of low
| memory networks like MobileNet - weirdly a workload that the V100
| is very poorly suited for...
| xiphias2 wrote:
| What about the M1X that will come with 64GB RAM? I'm thinking
| of waiting for that to come out. Ah...I just see that the
| article authors are waiting for it as well
| enos_feedler wrote:
| Not surprising since this is a training use case that Apple
| very much focuses on with CreateML.
| tbalsam wrote:
| This is on a model designed to run faster on CPUs. It's like
| dropping a bowling ball on your foot and claiming excitement that
| you feel bruised after a few days.
|
| Maybe there's something interesting there, definitely, but the
| overhype of the title takes away any significant amount of clout
| I'd give to the publishers for research. If you find something
| interesting, say it, and stop making vapid generalizations for
| the sake of more clicks.
|
| Remember, we only can feed the AI hype bubble when we do this. It
| might be good results, but we need to be at least realistic about
| it, or there won't be an economy of innovation for people to
| listen to in the future, because they've tuned it out with all of
| the crap marketing that comes/came before it.
|
| Thanks for coming to my TED Talk!
| lukas wrote:
| I don't think MobileNetV2 is designed to train on GPUs -
| according to this https://azure.microsoft.com/en-us/blog/gpus-
| vs-cpus-for-depl... MobileNetV2 gets bigger gains from GPUs vs
| several CPUs than ResNet. You could argue the batch size
| doesn't fully use the V100 but these comparisons are tricky and
| this looks like fairly normal training to me.
|
| It's pretty surprising to me that an M1 performs anywhere near
| a V100 on model training and I guess the most striking thing is
| the energy efficiency of the M1.
| tbalsam wrote:
| MV2 is memory-limited, the depthwise + groups + 1x1 convs has
| a long launch time on GPU. Shattered kernels are fine for
| CPU, but not for GPU.
|
| Though per your note on the scales, that's really interesting
| empirical results. I'll have to look into that, thanks for
| passing that along.
| [deleted]
| SloopJon wrote:
| The first graph includes "Apple Intel", which is not mentioned
| anywhere else in the post. Any idea what hardware that was, and
| whether it used the accelerated TensorFlow?
| vanpelt wrote:
| My bad, this was using non-Accelerated TensorFlow on a 2.3GHz
| 8-Core i9.
| volta87 wrote:
| When developing ML models, you rarely train "just one".
|
| The article mentions that they explored a not-so-large hyper-
| parameter space (i.e. they trained multiple models with different
| parameters each).
|
| It would be interesting to know how long does the whole process
| takes on the M1 vs the V100.
|
| For the small models covered in the article, I'd guess that the
| V100 can train them all concurrently using MPS (multi-process
| service: multiple processes can concurrently use the GPU).
|
| In particular it would be interesting to know, whether the V100
| trains all models in the same time that it trains one, and
| whether the M1 does the same, or whether the M1 takes N times
| more time to train N models.
|
| This could paint a completely different picture, particularly for
| the user perspective. When I go for lunch, coffee, or home, I
| usually spawn jobs training a large number of models, such that
| when I get back, all these models are trained.
|
| I only start training a small number of models at the latter
| phases of development, when I have already explored a large part
| of the model space.
|
| ---
|
| To make the analogy, what this article is doing is something
| similar to benchmarking a 64 core CPU against a 1 core CPU using
| a single threaded benchmark. The 64 core CPU happens to be
| slightly beefier and faster than the 1 core CPU, but it is more
| expensive and consumes more power because... it has 64x more
| cores. So to put things in perspective, it would make sense to
| also show a benchmark that can use 64x cores, which is the reason
| somebody would buy a 64-core CPU, and see how the single-core one
| compares (typically 64x slower).
|
| ---
|
| To me, the only news here is that Apple GPU cores are not very
| far behind NVIDIA's cores for ML training, but there is much more
| to a GPGPU than just the perf that you get for small models in a
| small number of cores. Apple would still need to (1) catch up,
| and (2) extremely scale up their design. They probably can do
| both if they set their eyes on it. Exciting times.
| nightcracker wrote:
| > When developing ML models, you rarely train "just one".
|
| Depends on your field. In Reinforcement Learning you often
| really do train _just one_ , at least on the same data set
| (since the data set often is dynamically generated based on the
| behavior of the previous iteration of the model).
| volta87 wrote:
| Even in reinforcement learning you can train multiple model
| with different data-sets concurrently and combine them for
| the next iteration.
| sdenton4 wrote:
| The low gpu utilization rate in the first graph is kind of a
| tell... Seems like the M1 is a little bit worse than 40% of a
| v100?
| lukas wrote:
| Do you really train more than one model at the same time on a
| single GPU? In my experience that's pretty unusual.
|
| I completely agree with your conclusion here.
| junipertea wrote:
| I found training multiple models on same GPU hit other
| bottlenecks (mainly memory capacity/bandwidth) fast. I tend
| to train one model per GPU and just scale the number of
| computers. Also, if nothing else, we tend to push the models
| to fit the GPU memory.
| jlouis wrote:
| CPUs often outperform specialized hardware on small models. This
| is nothing new. You'd need to go to a larger model, and then
| power consumption curves change too.
| StavrosK wrote:
| I'm seeing a lot of M1 hype, and I suspect most of it us
| unwarranted. I looked at comparisons between the M1 and the
| latest Ryzens, and it looks like it's comparable? Does anyone
| know details? I only looked summarily.
| ZeroCool2u wrote:
| The main hype is that performance is similar, but the M1 does
| it with a lot less power draw. The performance itself isn't too
| crazy. It's just crazy that it does it with a somewhat similar
| power draw to a high end phone.
| sradman wrote:
| I categorize this as an exploration of how to benchmark
| desktop/workstation NPUs [1] similar to the exploration Daniel
| Lemire started with SIMD. Mobile SoC NPUs are used to deploy
| inference models on smartphones and IoT devices while discreet
| NPUs like Nvidia A100/V100 target cloud clusters.
|
| We don't have apples-to-apples benchmarks like SPECint/SPECfp for
| the SoC accelerators in the M1 (GPU, NPU, etc.) so these early
| attempts are both facile and critical as we try to categorize and
| compare the trade-offs between the SoC/discreet and
| performance/perf-per-watt options available.
|
| Power efficient SoC for desktops is new and we are learning as we
| go.
|
| [1] https://en.m.wikipedia.org/wiki/AI_accelerator
| volta87 wrote:
| > We don't have apples-to-apples benchmarks
|
| We do: https://mlperf.org/
|
| Just run their benchmarks. Submitting your results there is a
| bit more complicated, because all results there are "verified"
| by independent entities.
|
| If you feel like your AI use case is not well represented by
| any of the MLPerf benchmarks, open a discussion thread about
| it, propose a new benchmark, etc.
|
| The set of benchmarks there increases all the time to cover new
| applications. For example, on top of the MLPerf Training and
| MLPerf Inference benchmark suites, we now have a new MLPerf HPC
| suite to capture ML of very large models.
| solidasparagus wrote:
| Those benchmarks are absurdly tuned to the hardware. Just
| look at the result Google gets with BERT on V100s vs the
| result NVIDIA gets with V100s. It's an interesting
| measurement of what experts can achieve when they modify
| their code to run on the hardware they understand well, but
| it isn't useful beyond that.
| volta87 wrote:
| > Just look at the result Google gets with BERT on V100s vs
| the result NVIDIA gets with V100s.
|
| These benchmarks measure the combination of
| hardware+software to solve a problem.
|
| Google and NVIDIA are using the same hardware, but their
| software implementation is different.
|
| ---
|
| The reason mlperf.org exists is to have a meaningful set of
| relevant practical ML problems that can be used to compare
| and improve hardware and software for ML.
|
| For any piece of hardware, you can create an ML benchmark
| that's irrelevant in practice, but perform much better on
| that hardware than the competition. That's what we used to
| have before mlperf.org was a thing.
|
| We shouldn't go back there.
| sradman wrote:
| > on top of the MLPerf Training and MLPerf Inference
| benchmark suites, we now have a new MLPerf HPC suite to
| capture ML of very large models.
|
| I think the challenge is selecting the tests that best
| represent the typical ML/DL use cases for the M1 and
| comparing it to an alternative such as the V100 using a
| common toolchain like Tensorflow. One of the problems that I
| see is that the optimizer/codegen of the toolchain is a key
| component; the M1 has both GPU and Neural Engine and we don't
| know which accelerator is targeted or even possibly both.
| Should we benchmark ML Create on M1 vs A14 or A12X? Perhaps
| it is my ignorance but I don't think we are at a point where
| our existing benchmarks can be applied meaningfully with the
| M1 but I'm sure we will get there soon.
| procrastinatus wrote:
| One thing I haven't seen much mention of is getting things to run
| on the M1's neural engine instead of the GPU - it seems like the
| neural engine has ~3x more compute capacity and is specifically
| optimized for this type of computation.
|
| Has anyone spotted any work allowing a mainstream tensor library
| (e.g. jax, tf, pytorch) to run on the neural engine?
| lldbg wrote:
| George hotz got his "for play" tensor library[a] to run on the
| Apple Neural Engine (ANE). The results were somewhat
| dissapointing, however, and currently it only does relu.
|
| [a]: https://github.com/geohot/tinygrad
| mark_l_watson wrote:
| I had the same experience. My M1 system does well on smaller
| models compared to a NVidia 1070 with 10GB of memory. My MacBook
| Pro only has 8GB total memory. Large models run slowly.
|
| I found setting up Apple's M1 fork of TensorFlow to be fairly
| easy, BTW.
|
| I am writing a new book on using Swift for AI applications,
| motivated by the "niceness" of the Swift language and Apple's
| CoreML libraries.
| iluxonchik wrote:
| do you happen to have a draft version available somewhere? i'm
| diving into ML with Swift soon
| fxtentacle wrote:
| "trainable_params 12,810"
|
| _laughs_
|
| (for comparison, GPT3: 175,000,000,000 parameters)
|
| Can Apple's M1 help you train tiny toy examples with no real-
| world relevance? You bet it can!
|
| Plus it looks like they are comparing Apples to Oranges ;) This
| seems to be 16 bit precision on the M1 and 32 bit on the V100. So
| the M1-trained model will most likely yield worse or unusable
| results, due to lack of precision.
|
| And lastly, they are plainly testing against the wrong target.
| The V100 is great, but it is far from NVIDIA's flagship for
| training small low-precision models. At the FP16 that the M1 is
| using, the correct target would have been an RTX 3090 or the
| like, which has 35 TFLOPS. The V100 only gets 14 TFLOPS because
| it lacks the dedicated TensorRT accelerator hardware.
|
| So they compare the M1 against an NVIDIA model from 2017 that
| lacks the relevant hardware acceleration and, thus, is a whopping
| 60% slower than what people actually use for such training
| workloads.
|
| I'm sure my bicycle will also compare very favorably against a
| car that is lacking two wheels :p
| JacobSuperslav wrote:
| thanks for the thorough comment. the article is, unfortunately,
| just clickbait.
| joseph_grobbles wrote:
| Thorough? Their comment is noisy snark.
|
| A huge number of models are "small". I'm currently training
| game units for autonomous behaviors. The M1 is _massively_
| oversized for my need.
|
| Saying "Oh look, GPT-3" just stupidifies the conversation,
| and is classic dismissive nonsense.
| coldtea wrote:
| The comment is bogus empty snark (and factually wrong).
|
| The arguments made (and I use the word arguments loosely):
|
| "Too few trainable_params compared to GTP3".
|
| GTP3 is several orders of magnitude higher than what people
| train, and so it's a useless comparison. It's like we're
| comparing a bike to an e-bike, and someone says "yeah, but
| can the e-bike run faster than a rocket?"
|
| Second argument "Sure, it's faster than a machine that costs
| 3-4 fives more, but you should instead compare it to a
| machine that costs even more than that".
|
| I can only take it as a troll comment.
| nvarsj wrote:
| It seems like a common trend with M1 articles on HN lately.
| YetAnotherNick wrote:
| For the first graph: trainable parameters:
| 2236682
| qayxc wrote:
| So it's a toy model...
| coolness wrote:
| This, not to mention one could get the GPU usage on the V100
| way higher by training with larger batch sizes, which would
| also make training much faster.
| jbverschoor wrote:
| Even the RTX 3090 is double the price of an M1 for just 1 card.
|
| The V100 is almost 5-10x the price of an M1.
| iaml wrote:
| GPT3 is so big it would take 355 years to train on a nvidia
| V100, so your example is also not really useful for comparison.
| It would be interesting to see some mid-sized nn benchmarks
| though.
| Firadeoclus wrote:
| > The V100 only gets 14 TFLOPS because it lacks the dedicated
| TensorRT accelerator hardware.
|
| V100 has both vec2 hfma (i.e. fp16 multiply-add is twice the
| rate of fp32), getting ~30 TFLOPS, and tensor cores which can
| achieve up to 4x that for matrix multiplications.
| apl wrote:
| Hard disagree. V100s are a perfectly valid comparison point.
| They're usually what's available at scale (on AWS, in private
| clusters, etc.) because nobody's rolled out enough A100s at
| this point. If you look at any paper from OpenAI et al.
| (basically: not Google), you'll see performance numbers for
| large V100 clusters.
| fxtentacle wrote:
| Yes and you'll see parameters tuned for V100, not parameters
| tuned for m1 somehow limping along on a V100 in emulation
| mode.
|
| I wouldn't complain about a benchmark executing any real
| world SOTA model on m1 and V100, but those will most likely
| not even run on the M1 due to memory constraints.
|
| So this article is like using an ios game to evaluate a Mac
| pro. You can do it, but it's not really useful.
| YetAnotherNick wrote:
| You can count the number of GPUs having more than M1
| memory(16 GB) in a single hand.
| oblio wrote:
| Isn't the M1 GPU memory shared with everything else? Can
| the GPU realistically used that much? Won't the OS and
| base apps use up at least 2-3GB?
| qayxc wrote:
| The M1 can only address 8 GB with its NPU/GPU.
___________________________________________________________________
(page generated 2021-01-14 23:02 UTC)