[HN Gopher] Whisper: Nvidia RTX 4090 vs. M1 Pro with MLX
___________________________________________________________________
Whisper: Nvidia RTX 4090 vs. M1 Pro with MLX
Author : interpol_p
Score : 311 points
Date : 2023-12-13 14:52 UTC (8 hours ago)
(HTM) web link (owehrens.com)
(TXT) w3m dump (owehrens.com)
| whywhywhywhy wrote:
| Find these findings questionable unless Whisper is very poorly
| optimized the way it was run on a 4090.
|
| I have a 3090 and an M1 Max 32GB and and although I haven't tried
| Whisper the inference difference on Llama and Stable Diffusion
| between the two is staggering, especially with Stable Diffusion
| where SDXL is about 0:09 seconds 3090 and 1:10 minute on M1 Max.
| agloe_dreams wrote:
| It all is really messy, I would assume that almost any model is
| poorly optimized to run on Apple Silicon as well.
| kamranjon wrote:
| There has been a ton of optimization around whisper with
| regards to apple silicon, whisper.cpp is a good example that
| takes advantage of this - also this article is specifically
| referencing the new apple MLX framework which I'm guessing your
| tests with llama and stable diffusion weren't utilizing.
| sbrother wrote:
| I assume people are working on bringing a MLX backend to
| llama.cpp... Any idea what the state of that project is?
| tgtweak wrote:
| https://github.com/ml-explore/mlx-examples
|
| Several people working on mlx-enabled backends to popular
| ML workloads but it seems inference workloads are the most
| accelerated vs generative/training.
| oceanplexian wrote:
| M1 Max has 400GB/s of memory bandwidth and a 4090 has 1TB/s of
| memory bandwidth, M1 Max has 32 GPU cores and a 4090 has
| 16,000. The difference is more about how well the software is
| optimized for the hardware platform than any performance
| difference between the two, which are frankly not comparable in
| any way.
| codedokode wrote:
| I think that 4090 has 16000 ALUs, not "cores" (let's call a
| component capable to execute instructions independently from
| others, a "core"). And M1 Max probably has more than 1 ALU in
| every core, otherwise it resembles an ancient GPU.
| rsynnott wrote:
| Yeah; 'core' is a pretty meaningless term when it comes to
| GPUs, or at least it's meaningless outside the context of a
| particular architecture.
|
| We may just be thankful that this particular bit of
| marketing never caught on for CPUs.
| segfaultbuserr wrote:
| > _M1 Max has 32 GPU cores and a 4090 has 16,000._
|
| Apple M1 Max has 32 GPU _cores_ , each core contains 16
| _Execution Units_ , each EU has 8 _ALUs_ (also called
| _shaders_ ), so overall there are 4096 shaders. Nvidia RTX
| 4090 contains 12 _Graphics Processing Clusters_ , each GPC
| has 12 _Streaming Multi-Processors_ , and each SM has 128
| _ALUs_ , overall there are 18432 shaders.
|
| A single shader is somewhat similar to a single lane of a
| vector ALU in a CPU. One can say that a single-core CPU with
| AVX-512 has 8 shaders, because it can process 8 FP64s at the
| same time. Calling them "cores" (as in "CUDA core") is
| extremely misleading, so "shader" became the common name for
| a GPU's ALU due to that. If Nvidia is in charge of marketing
| a 4-core x86-64 CPU, they would call it a CPU with 32 "AVX
| cores" because each core has 8-way SIMD.
| jrk wrote:
| Actually each of those x86 CPUs probably has at least two
| AVX FMA units, and can issue 16xFP32 FMAs per cycle - it's
| at least "64 AVX cores"! :)
| kimixa wrote:
| Doesn't zen4 have 2x 256-bit FADD and 2x 256-bit FMA, and
| with avx512 ops it double-pumps the ALU (a good overview
| here [0]). If you count FADD as a single flop and FMA as
| 2, that's 48 "1 flop cores" per core.
|
| I think it's got the same total FP ALU resources as zen3,
| and shows how register width and ALU resources can be
| completely decoupled.
|
| [0] https://www.mersenneforum.org/showthread.php?p=614191
| stonemetal12 wrote:
| Nvidia switched to marketing speak a long time ago when it
| came to the word "core". If we go with Nvidia's definition
| then M1 Max has 4096 cores, still behind the 4090, but the
| gap isn't as big as 32 to 16k.
| ps wrote:
| I have 4090 and M1 Max 64GB. 4090 is far superior on Llama 2.
| astrodust wrote:
| On models < 24GB presumably. "Faster" depends on the model
| size.
| brucethemoose2 wrote:
| In this case, the 4090 is far more memory efficient thanks
| to ExLlamav2.
|
| 70B in particular is indeed a significant compromise on the
| 4090, but not as much as you'd think. 34B and down though,
| I think Nvidia is unquestionably king.
| michaelt wrote:
| Doesn't running 70B in 24GB need 2 bit quantisation?
|
| I'm no expert, but to me that sounds like a recipe for
| bad performance. Does a 70B model in 2-bit really
| outperform a smaller-but-less-quantised model?
| jb1991 wrote:
| But are you using the newly released Apple MLX optimizations?
| ps wrote:
| It's been approximately 2 months since I have tested it, so
| probably not.
| jb1991 wrote:
| But those optimizations are the subject of the article
| you are commenting on.
| woadwarrior01 wrote:
| You're taking benchmark numbers from a latent diffusion model's
| (SDXL) inference and extrapolating them to encoder-decoder
| transformer model's (Whisper) inference. These two model
| architectures have little in common (except perhaps the fact
| that Stable Diffusion models use a pre-trained text encoder
| from clip, which again is very different from an encoder-
| decoder transformer).
| brucethemoose2 wrote:
| The point still stands though. Popular models tend to to have
| massively hand optimized Nvidia implementations.
|
| Whisper is no exception:
| https://github.com/Vaibhavs10/insanely-fast-whisper
|
| SDXL is actually an interesting exception _for Nvidia_
| because most users still tend to run it in PyTorch eager
| mode. There are super optimized Nvidia implementations, like
| stable-fast but their use is less common. Apple, on the other
| hand, took the odd step of hand writing a Metal
| implementation themselves, at least for SD 1.5.
| tgtweak wrote:
| Modest 30x speedup
| kkielhofner wrote:
| This will determine who has a shot at actually being Nvidia
| competitive.
|
| What I like to say is (generally speaking) other
| implementations like AMD (ROCm), Intel, Apple, etc are
| more-or-less at the "get it to work" stage. Due to their
| early lead and absolute market dominance Nvidia has been at
| the "wring every last penny of performance out of this"
| stage for years.
|
| Efforts like this are a good step but they still have a
| very long way to go to compete with multiple layers
| (throughout the stack) of insanely optimized Nvidia/CUDA
| implementations. Bonus points nearly anything with Nvidia
| is a docker command that just works on any chip they've
| made in the last half decade from laptop to datacenter.
|
| This can be seen (dramatically) with ROCm. I recently took
| the significant effort (again) to get an LLM to run on an
| AMD GPU. The AMD GPU is "cheaper" in initial cost but when
| the dollar equivalent (to within 10-30%) Nvidia GPU is
| 5-10x faster (or whatever) you're not saving anything.
|
| You're already at a loss unless your time is free just to
| get it to work (random patches, version hacks, etc) and
| then the performance just isn't even close so the "value
| prop" of AMD currently doesn't make any sense whatsoever.
| The advantage for Apple is you likely spent whatever for
| the machine anyway, and when you have it just sitting in
| front of you for a variety of tasks the value prop
| increases significantly.
| woadwarrior01 wrote:
| Although both LDM inference and encoder-decoder and
| decoder-only LLM inference are both fundamentally
| autoregressive in nature, LLM inference is memory bound and
| LDM inference is compute bound. In that light, it makes
| sense that the difference between a 4090 and M1 Pro isn't
| as pronounced as one would expect at first approximation.
|
| Also, as you hint whisper.cpp certainly isn't one of the
| fastest implementations of whisper inference out there.
| Perhaps a comparison between a pure PyTorch version running
| on the 4090 with an MLX version of Whisper running on the
| M1 Pro would be fairer. Or better yet, run the whisper
| encoder on ANE with CoreML and have the decoder running
| with Metal and Accelerate (which uses Apple's undocumented
| AMX ISA) using MLX, since MLX currently does not use the
| ANE. IIRC, whisper.cpp has a similar optimization on Apple
| hardware, where it optionally runs the encoder using CoreML
| and the decoder using Metal.
| KingOfCoders wrote:
| "I haven't tried Whisper"
|
| I haven't tried the hardware/software/framework/... of the
| article, but I have an opinion on this exact topic.
| xxs wrote:
| The topic is benchmarking some hardware and specific
| implementation of some tool.
|
| The provided context is n earlier version of hardware where
| known implementations perform drastically differently, an
| order of magnitude differently.
|
| That leaves the question why that specific tool exhibits the
| behavior described in the article.
| tgtweak wrote:
| Reading through some (admittedly very early) MLX docs and it
| seems that convolutions (as used heavily in GANs and
| particularly stable diffusion) are not really seeing meaningful
| uplifts on MLX at all, and in some cases are slower than on the
| cpu.
|
| Not sure if this is a hardware limitation or just unoptimized
| MLX libraries but I find it hard to believe they would have
| just ignored this very prominent use case. It's more likely
| that convolutions use high precision and much larger tile sets
| that require some expensive context switching when the entire
| transform can't fit in the gpu.
| liuliu wrote:
| Both of your SDXL and M1 Max number should be faster (of
| course, it depends on how many steps). But the point stands,
| for SDXL, 3090 should be 5x to 6x faster than M1 Max and should
| be 2x to 2.5x faster than M2 Ultra.
| stefan_ wrote:
| Having used whisper a ton, there are versions of it that have
| one or two magnitudes of better performance at the same quality
| while using less memory for reasons I don't fully understand.
|
| So I'd be very careful about your intuition on whisper
| performance unless it's literally the same software and same
| model (and then the comparison isn't very meaningful still,
| seeing how we want to optimize it for different platforms).
| mv4 wrote:
| Thank you for sharing this data. I've just been debating
| between M2 Mac Studio Max and a 64GB i9 10900x with RTX 3090
| for personal ML use. Glad I chose the 3090! Would love to learn
| more about your setup.
| Flux159 wrote:
| How does this compare to insanely-fast-whisper though?
| https://github.com/Vaibhavs10/insanely-fast-whisper
|
| I think that not using optimizations allows this to be a 1:1
| comparison, but if the optimizations are not ported to MLX, then
| it would still be better to use a 4090.
|
| Having looked at MLX recently, I think it's definitely going to
| get traction on Macs - and iOS when Swift bindings are released
| https://github.com/ml-explore/mlx/issues/15 (although there might
| be some C++20 compilation issue blocking right now).
| brucethemoose2 wrote:
| This is the thing about Nvidia. Even if some hardware beats
| them in a benchmark, if its a popular model, there will be some
| massively hand optimized CUDA implementation that blows
| anything else out of the water.
|
| There are some rare exceptions (like GPT-Fast on AMD thanks to
| PyTorch's hard work on torch.compile, and only in a narrow use
| case), but I can't think of a single one for Apple Silicon.
| MBCook wrote:
| I wouldn't be surprised a $2k top of the line GPU is a
| match/better than the built in accelerator on a Mac. Even if
| the Mac was slightly faster you could just stick multiple
| GPUs in a PC.
|
| To me the news here is how well the Mac runs without needing
| that additional hardware/large power draw on this benchmark.
| rfoo wrote:
| > but I can't think of a single one for Apple Silicon.
|
| The post here is exactly one for Apple Silicon. It compared a
| naive implementation in PyTorch which may not even keep 4090
| busy (for smaller/not-that-compute-intensive models having
| the entire computation driven by Python is... limiting, which
| is partly why torch.compile gives amazing improvements) to a
| purposedly-optimized one (optimized for both CPU/GPU
| efficiency) for Apple Silicon one.
| jeroenhd wrote:
| The one being benchmarked here is heavily optimised for Apple
| Silicon. I think there are a few algorithms that Apple uses
| (like the one tagging faces on iPhones) that are heavily
| optimised for Apple's own hardware.
|
| I think Apple's API would be as popular as CUDA if you could
| rent their chips at scale. They're quite efficient machines
| that don't need a lot of cooling, so I imagine the OPEX of
| keeping them running 24/7 in big cloud racks would be pretty
| low if they were optimised for server usage.
|
| Apple seems to focus their efforts on bringing purpose-built
| LLMs to Apple machines. I can see why it makes sense (just
| like Google's attempts to bring Tensor cores to mobile) but
| there's not much practical use in this technology right now.
| Whisper is the first usable technology like this, but even my
| Android phone can live translate spoken text into words as an
| accessibility feature, I don't think Apple can sell Whisper
| as a product to end users.
| jdminhbg wrote:
| > The one being benchmarked here is heavily optimised for
| Apple Silicon.
|
| I don't think so, in the sense of a hand-optimized CUDA
| implementation. This just using the MLX API in the same way
| that you'd use CUDA via PyTorch or something.
| mac-mc wrote:
| Apple would need to make rackmount versions of the machines
| with replaceable storage and maybe RAM and would really
| need to really beef up their headless management systems of
| the machines before they start becoming competitive.
|
| Otherwise you need a whole bunch of custom mac mini style
| racks and management software which really increases costs
| and lead times. If you don't believe me, look how expensive
| AWS macOS machines are compared to linux ones with
| equivalent performance.
| poyu wrote:
| They already make rack mount Mac Pros. But yeah they need
| to up their game on the management software
| claytonjy wrote:
| To have a good comparison I think we'd need to run the
| insanely-fast-whisper code on a 4090. I bet it handily beats
| both the benchmarks in OP, though you'll need a much smaller
| batch size than 24.
|
| You can beat these benchmarks on a CPU; 3-4x realtime is very
| slow for whisper these days!
| chrisbrandow wrote:
| he updated with insanely-fast
| bcatanzaro wrote:
| What precision is this running in? If 32-bit, it's not using the
| tensor cores in the 4090.
| DeathArrow wrote:
| Ok, OpenAI will ditch Nvidia and buy macs instead. :)
| baldeagle wrote:
| Only if Sam Altman is appointed to the Apple board. ;)
| tiffanyh wrote:
| Key to this article is understanding it's leveraging the newly
| released Apple MLX, and their code is using these Apple specific
| optimizations.
|
| https://news.ycombinator.com/item?id=38539153
| modeless wrote:
| Also, this is not comparing against an optimized Nvidia
| implementation. There are faster implementations of Whisper.
|
| Edit: OK I took the bait. I downloaded the 10 minute file he
| used and ran it on _my_ 4090 with insanely-fast-whisper, which
| took two commands to install. Using whisper-large-v3 the file
| is transcribed in less than eight seconds. Fifteen seconds if
| you include the model loading time before transcription starts
| (obviously this extra time does not depend on the length of the
| audio file).
|
| That makes the 4090 somewhere between 6 and 12 _times_ faster
| than Apple 's best. It's also _much_ cheaper than M2 Ultra if
| you already have a gaming PC to put it in, and _still_ cheaper
| even if you buy a whole prebuilt PC with it.
|
| This should not be surprising to people, but I see a lot of
| wishful thinking here from people who own high end Macs and
| want to believe they are good at everything. Yes, Apple's
| M-series chips are very impressive and the large RAM is great,
| but they are _not_ competitive with Nvidia at the high end for
| ML.
| isodev wrote:
| It also wasn't optimised for Apple Silicon. Given how the
| different platforms performed in this test, the conclusions
| seem pretty solid.
| jbellis wrote:
| He is literally comparing whisper.cpp on the 4090 with an
| optimized-for-apple-silicon-by-apple-engineers version on
| the M1.
|
| ETA: actually it's unclear from the article if the whisper
| optimizations were done by apple engineers, but it's
| definitely an optimized version.
| rowanG077 wrote:
| I don't think whisper was optimized for apple silicon.
| Doesn't it just use MLX? I mean if using an API for a
| platform counts as specifically optimized then the Nvidia
| version is "optimized" as well since it's probably using
| CUDA.
| isodev wrote:
| Maybe I'm not seeing it right, but comparing the source
| of Apple's Whisper to Python Whisper seems there are
| minimal changes to redirect certain operations to using
| MLX.
|
| There is also cpp Whisper
| (https://github.com/ggerganov/whisper.cpp) which seems to
| have it's own kind of optimizations for Apple Silicon - I
| don't think this was the one used with Nvidia during the
| test.
| swores wrote:
| Would you be so kind as to link to a guide for your method or
| share it in a comment yourself?
|
| I installed following the official docs and found it much,
| much slower, although I sadly don't have a 4090, instead a
| 3080 Ti 12GB (just big enough to load the large whisper model
| into GPU memory).
| modeless wrote:
| I'm running on Linux with a 13900k, 64 GB RAM, and I
| already have CUDA installed. Install commands directly from
| the README: pipx install insanely-fast-
| whisper pipx runpip insanely-fast-whisper
| install flash-attn --no-build-isolation
|
| To transcribe the file: insanely-fast-
| whisper --flash True --file-name ~/Downloads/podcast_1652_w
| as_jetzt_episode_1289963_update_warum_streiken_sie_schon_wi
| eder_herr_zugchef.mp3 --language german --model-name
| openai/whisper-large-v3
|
| The file can be downloaded at: https://adswizz.podigee-
| cdn.net/version/1702050198/media/pod...
|
| I just ran it again and happened to get an even better
| time, under 7 seconds without loading and 13.08 seconds
| including loading. In case anyone is curious about the use
| of Flash Attention, I tried without it and transcription
| took under 10 seconds, 15.3 including loading.
| owehrens wrote:
| I just can't get it to work, it errors out with
| 'NotImplementedError: The model type whisper is not yet
| supported to be used with BetterTransformer.' Did you
| happen to run into this problem?
| modeless wrote:
| Sorry, I didn't encounter that error. It worked on the
| first try for me. I have wished many times that the ML
| community didn't settle on Python for this reason...
| owehrens wrote:
| Thanks so much again. Got it working. 8 seconds. Nvidia
| is the king. Updated the blog post.
| darkteflon wrote:
| I think insanely-faster-whisper uses batching, so faster-
| whisper (which doesn't) might be a fairer comparison for
| the purposes of your post.
| swores wrote:
| Thanks!
|
| Another question that's only slightly related, but while
| we're here...
|
| Using OAI's paid Whisper API, you can give a text prompt
| to a) set the tone/style of the transcription and b)
| teach it technical terms, names etc that it might not be
| familiar with and should expect in the audio to
| transcribe.
|
| Am I correct that this isn't possible with any released
| versions of Whisper, or is there a way to do it on my
| machine that I'm not aware of?
| modeless wrote:
| You can definitely do this with the open source version.
| Many transcription implementations use it to maintain
| context between the max-30-second chunks Whisper natively
| supports.
| swores wrote:
| I'll try to understand some of how stuff like faster-
| whisper works when I've got time over the weekend, but I
| fear it may be too complex for me...
|
| I was rather hoping for a guide of just how to either
| adapt classic whisper usage or adapt one of the optimised
| ones like faster-whisper (which I've just set up in a
| docker container but that's used up all the time I've got
| for playing around right now) to take a text prompt with
| the audio file.
| sundvor wrote:
| Cheers, I've been wanting to get into doing something
| else with my 4090 order than multi monitor simulator
| gaming, quad screen workstation work - and this will get
| me kicked off!
|
| The 4090 is an absolute beast, runs extremely quiet and
| simply powers through everything. DCS pushes it to the
| limit, but the resulting experience is simply stunning.
| Mine's coupled to a 7800x3d which uses hardly any power
| at all, absolutely love it.
| modeless wrote:
| If you're looking for something easy to try out, try my
| early demo that hooks Whisper to an LLM and TTS so you
| can have a real time speech conversation with your local
| GPU that feels like talking to a person! It's way faster
| than ChatGPT:
| https://apps.microsoft.com/detail/9NC624PBFGB7
| owehrens wrote:
| You mean 'https://github.com/Vaibhavs10/insanely-fast-
| whisper' ? Did not know that until now. I'm running all of
| that since over ~10 months and just had it running. Happy to
| try that out. The GPU is fully utilized by using whisper and
| pytorch with cuda and all. Thanks for the link.
| WhitneyLand wrote:
| I'm afraid the article as well as your benchmarks can be
| misleading because there are a lot of different whisper
| implementations out there.
|
| For ex ctranslate optimized whisper "implements a custom
| runtime that applies many performance optimization techniques
| such as weights quantization, layers fusion, batch
| reordering..."
|
| Intuitively, I would agree with your conclusion about Apple's
| M-Series being impressive for what they do but not generally
| competitive with Nvidia in ML.
|
| Objectively however, I don't see concluding much with what's
| on offer here. Once you start changing libraries, kernels,
| transformer code, etc you end up with an apples to oranges
| comparison.
| modeless wrote:
| I think it's fair to compare the fastest available
| implementation on both platforms. I suspect that the MLX
| version can be optimized further. However, it will not
| close a 10x gap.
| darkteflon wrote:
| Surely the fact that IFW uses batching makes it apples to
| oranges? The MLX-enabled version didn't batch, did it? That
| fundamentally changes the nature of the operation. Wouldn't
| the better comparison be faster-whisper?
| modeless wrote:
| I don't know exactly what the MLX version did, but you're
| probably right. I'd love to see the MLX side optimized to
| the max as well. I'm confident that it would not reach the
| performance of the 4090, but it might do better.
|
| That said, for practical purposes, the ready availability
| of Nvidia-optimized versions of every ML system is a big
| advantage in itself.
| darkteflon wrote:
| Yeah, I think everyone knows that Nvidia is doing a
| cracker job. But it is good to just be specific about
| these benchmarks because numbers get thrown around and it
| turns out people are testing different things. The other
| thing is that Apple is extracting this performance on a
| laptop, at ~1/8 the power draw of the desktop Nvidia
| card.
|
| In any event, it's super cool to see such huge leaps just
| in the past year on how easy it is to run this stuff
| locally. Certainly looking very promising.
| modeless wrote:
| The M2 Ultra that got the best numbers that I was
| comparing to is not in a laptop. Regardless, you're
| probably right that the power consumption is
| significantly lower per unit time. However, is it lower
| per unit work done? It would be interesting to see a
| benchmark optimized for power. Nvidia's power consumption
| can usually be significantly reduced without much
| performance cost.
|
| Also, the price difference between a prebuilt 4090 PC and
| a M2 Ultra Mac Studio can buy a lot of kilowatt hours.
| justinclift wrote:
| > Nvidia at the high end for ML.
|
| Wouldn't the high end for Nvidia be their dedicated gear
| rather than a 4090?
| modeless wrote:
| Yes, H100 would be faster still, and Grace Hopper perhaps
| even somewhat faster. But Apple doesn't have comparable
| datacenter-only products, so it's also interesting to see
| the comparison of consumer hardware. Also, 4090 is cheaper
| than Apple's best, but H100 is more than both (if you are
| even allowed to buy it).
| intrasight wrote:
| Honest question: Why would I (or most users) care? If I have a
| Mac, I'm going to get the performance of that machine. If I
| have a gaming PC, I'll get the performance of that machine. If
| I have both, I'm still likely to use whichever AI is running on
| my daily driver.
| tomatotomato31 wrote:
| Curiosity for example.
|
| Or having a better image of performance in your head when you
| buy new hardware.
|
| It's just a blog article. The time and effort to make it and
| to consume it is not in the range of millions there is no
| need for 'more'.
| winwang wrote:
| I have both. I drive both depending on context of my day and
| where I want to work. If I were interested in this, and the
| 4090 were faster, I'd just ssh into that machine from my Mac
| if I wanted to use my Mac -- and presumably, vice versa.
| shikon7 wrote:
| Suppose you're a serious hobbyist, wanting to buy the most
| powerful consumer device for deep learning. Then the choice
| of buying a RTX 4090 or a M3 Mac is probably quite
| interesting.
|
| Personally, I have already a RTX 3090, so for me it's
| interesting if a M3 would be a noticeable upgrade
| (considering RAM, speed, library support)
| light_hue_1 wrote:
| It's really not interesting. The Nvidia card is far
| superior.
|
| The super optimized code for a network like this is one
| test case. General networks are another.
| mi_lk wrote:
| > The Nvidia card is far superior.
|
| I mean, you're probably right, but MLX framework was
| released just a week ago so maybe we don't really know
| what it's capable of yet.
| mac-mc wrote:
| The "available VRAM" is the big difference. You get a lot
| more to use on an apple silicon machine for much cheaper
| than you get on an nvidia card on a $/GB basis,
| especially if you want to pass the 24GB threshold. You
| can't use NVIDIA consumer cards after that and there is a
| shortage of them.
|
| The power:perf ratios seem about equal although, so it's
| really up to apple to release equivalently sized silicon
| on their desktop to really have this be a 1:1 comparison.
| swores wrote:
| I wonder how complicated it would be to make happen / how
| likely it is to happen, for the Apple way of sharing
| memory between CPU and GPU to become a thing on PCs.
| Could NVIDIA do it unilaterally or would it need to be in
| partnership with other companies?
|
| Could anyone who understands this hardware better than me
| chime in on the complexities of bringing unified ram/vram
| to PC, or reasons it hasn't happened yet?
| Lalabadie wrote:
| There will be a lot of debate about which is the absolute best
| choice for X task, but what I love about this is the level of
| performance at such a low power consumption.
| SlavikCA wrote:
| It's easy to run Whisper on my Mac M1. But it's not using MLX out
| of the box.
|
| I spend an hour or two, trying to run figure out what I need to
| install / configure to enable it to use MLX. Was getting cryptic
| Python errors, Torch errors... Gave up on it.
|
| I rented VM with GPU, and started Whisper on it within few
| minutes.
| xd1936 wrote:
| I've really enjoyed this macOS Whisper GUI[1]. It doesn't use
| MLX, but does use Metal.
|
| 1. https://goodsnooze.gumroad.com/l/macwhisper
| tambourine_man wrote:
| Is was released last week. Give it a month or two
| JCharante wrote:
| Hmm, I've been using this product for whisper
| https://betterdictation.com/
| jonnyreiss wrote:
| I was able to get it running on MLX on my M2 Max machine within
| a couple minutes using their example: https://github.com/ml-
| explore/mlx-examples/tree/main/whisper
| etchalon wrote:
| The shocking thing about these M series comparisons is never "the
| M series is fast as the GIANT NVIDIA THING!" it's always "Man,
| the M series is 70% as fast with like 1/4 the power."
| kllrnohj wrote:
| It's not really that shocking. Power consumption is non-linear
| with respect to frequency and you see this all the time in high
| end CPU & GPU parts. Look at something like the eco modes on
| the Ryzen 7xxx for a great example. The 7950X stock pulls
| something like 260w on an all core workload at 5.1ghz. Yet
| enable the 105w eco mode and that power consumption _plummets_
| to 160w at 4.8ghz. That means the last 300mhz of performance,
| which is borderline inconsequential (~6% performance loss),
| costs _100W_. The 65W option then cuts that in half almost
| again down to 88w (at 4ghz now), for a "mere" 20% reduction in
| performance. Or phrased differently, for 1/3rd the power the
| 7950X will give you 75% of the performance of a 7950X.
|
| _Matching_ performance while using less power is impressive.
| Using less power while also being slower not so much, though.
| nottorp wrote:
| > Or phrased differently, for 1/3rd the power the 7950X will
| give you 75% of the performance of a 7950X.
|
| So where is the - presumably much cheaper - 65 W 7950X?
| dotnet00 wrote:
| Why would it be much cheaper? The chips are intentionally
| clocked higher than the most efficient point because the
| point of the CPU is raw speed, not power consumption,
| especially since 7950x is a desktop chip.
|
| His point is that M-CPUs being somewhat competitive but
| much more efficient is not as stunning since you're
| comparing a CPU tuned to be at the most efficient speed to
| a CPU tuned to be at the highest speed.
|
| Similarly a 4090's power consumption drops dramatically if
| you underclock or undervolt even slightly, but what's the
| point? You're almost definitely buying a 4090 for its raw
| speed.
| nottorp wrote:
| Because they would be able to sell some rejects.
|
| And because I don't want a software limit that may or may
| not work.
| wtallis wrote:
| AMD's chiplet-based design means they have plenty of
| other ways to make good use of parts that cannot hit the
| highest clock speeds. They have very little reason to do
| a 16-core low-clock part for their consumer desktop
| platform.
|
| And your concerns about "a software limit that may or may
| not work" are completely at odds with how their power
| management works.
| dotnet00 wrote:
| It isn't a "software limit" beyond just being controlled
| by writing to certain CPU registers via software. It's
| very much a feature of the hardware, the same feature
| that allows for overclocking the chips.
| kllrnohj wrote:
| That's not how binning works. Quality silicon is also the
| ones that are more efficient. A flagship 65W part would
| be just as expensive as a result, it's the same-ish
| quality of parts.
| coder543 wrote:
| _Every_ 7950X offers a 65W mode. It's not a separate SKU.
|
| It's a choice each user can make if they care more about
| efficiency. Tasks take longer to complete, but the total
| energy consumed for the completion of the task is
| dramatically less.
| nottorp wrote:
| You think? I think it's a choice that very few technical
| users even know about. And of those, 90% don't care about
| efficiency.
|
| The 250 W space heater vacuum cleaner soundtrack mode
| should be opt in rather than opt out. Same for video
| cards.
| coder543 wrote:
| Of course most users don't pick the 65W option. They want
| maximum performance, and the cost of electricity is
| largely negligible to most people buying a 7950X.
|
| AMD isn't going to offer a huge discount for a 65W 7950X
| for the reasons discussed elsewhere: they don't need to.
| lern_too_spel wrote:
| As others have pointed out, it is nowhere near even 1/4 as
| fast.
| etchalon wrote:
| > The result for a 10 Minute audio is 0:03:36.296329 (216
| seconds). Compare that to 0:03:06.707770 (186 seconds) on my
| Nvidia 4090. The 2000 EUR GPU is still 30 seconds or ~ 16%
| faster. All graphics core where fully utilized during the run
| and I quit all programs, disabled desktop picture or similar
| for that run.
| lern_too_spel wrote:
| What other people have mentioned is that there are multiple
| implementations for Nvidia GPUs that are many times faster
| than whisper.cpp.
| etchalon wrote:
| And my comment was about M1 comparison articles, of which
| this was one. And which exhibited the property I
| mentioned.
| lern_too_spel wrote:
| It does this poorly. They compared the currently most
| optimized whisper implementation for M1 against an
| implementation that is far from the best currently
| available for the Nvidia GPU. They cannot make a claim of
| reaching 70% speed at 25% power usage.
| tgtweak wrote:
| Does this translate to other models or was whisper cherry picked
| due to it's serial nature and integer math? looking at
| https://github.com/ml-explore/mlx-examples/tree/main/stable_...
| seems to hint that this is the case:
|
| >At the time of writing this comparison convolutions are still
| some of the least optimized operations in MLX.
|
| I think the main thing at play is the fact you can have 64+G of
| very fast ram directly coupled to the cpu/gpu and the benefits of
| that from a latency/co-accessibility point of view.
|
| These numbers are certainly impressive when you look at the power
| packages of these systems.
|
| Worth considering/noting that the cost of m3 max system with the
| minimum ram config is ~2x the price of a 4090...
| densh wrote:
| Apple's silicon memory is fast only in comparison to consumer
| CPUs that stagnated for ages with having only 2 memory channels
| which was fine in 4 core era but wakes no sense at all with
| modern core counts. Memory scaling on GPUs is much better, even
| on the consumer front.
| mightytravels wrote:
| Use this Whisper derivative repo instead - one hour of audio gets
| transcribed within a minute or less on most GPUs -
| https://github.com/Vaibhavs10/insanely-fast-whisper
| thrdbndndn wrote:
| Could someone elaborate how this is accomplished and if there
| is any quality disparity compared to original?
|
| Repos like https://github.com/SYSTRAN/faster-whisper makes
| immediate sense on why it's faster than the original
| implementation, and lots of others do so by lowering
| quantization precision etc (and worse results).
|
| but this one, it's not very clear how. Especially considering
| it's even much faster.
| lern_too_spel wrote:
| The Acknowledgments section on the page that GP shared says
| it's using BetterTransformer. https://huggingface.co/docs/opt
| imum/bettertransformer/overvi...
| mightytravels wrote:
| From what I can see it is parallel batch processing - default
| for that repo is 24. You can reduce batches and if you use 1
| it's as fast or slow as Whisper. Quality is the exact same
| (same large model used).
| claytonjy wrote:
| Anecdotally I've found ctranslate2 to be even faster than
| insanely-fast-whisper. On an L4, using ctranslate2 with a batch
| size as low as 4 beats all their benchmarks except the A100
| with flash attention 2.
|
| It's a shame faster-whisper never landed batch mode, as I think
| that's preventing folks from trying ctranslate2 more easily.
| theschwa wrote:
| I feel like this is particularly interesting in light of their
| Vision Pro. Being able to run models in a power efficient manner
| may not mean much to everyone on a laptop, but it's a huge
| benefit for an already power hungry headset.
| darknoon wrote:
| Would be more interesting if Pytorch with MPS backend was also
| included.
| LiamMcCalloway wrote:
| I'll take this opportunity to ask for help: What's a good open
| source transcription and diarization app or work flow?
|
| I looked at https://github.com/thomasmol/cog-whisper-diarization
| and https://about.transcribee.net/ (from the people behind
| Audapolis) but neither work that well -- crashes, etc.
|
| Thank you!
| dvfjsdhgfv wrote:
| I developed my own solutions, pretty rudimentary - it divides
| the MP3s into chunks that Whisper is able to handle and then
| sends them one by one to the API to transcribe. Works as
| expected so far, it's just a couple of lines of Python code.
| mosselman wrote:
| I would like to know the same.
|
| It shouldn't be so hard since many apps have this. But what is
| the most reliable way right now?
| throwaw33333434 wrote:
| META: is M3 Pro good enough to run Cyberpunk 2077 smoothly? Does
| Max really makes a difference?
| ed_balls wrote:
| 14 inch M3 Max may overheat.
| iAkashPaul wrote:
| There's a better parallel/batching that works on the 30s chunks
| resulting in 40X. From HF at
| https://github.com/Vaibhavs10/insanely-fast-whisper
|
| This is again not native PyTorch so there's still room to have
| better RTFX numbers.
| bee_rider wrote:
| Hmm... this is a dumb question, but the cookie pop up appears to
| be in German on this site. Does anyone know which button to press
| to say "maximally anti-tracking?"
| layer8 wrote:
| If only we had a way to machine-translate text or to block such
| popups.
| bee_rider wrote:
| I think it is better not to block these sorts of pop-ups,
| they are part of the agreement to use the site after all.
|
| Anyway the middle button is "refuse all" according to my
| phone, not sure how accurate the translation is or if they'll
| shuffle the buttons for other people.
|
| It is poor design to have what appear to be "accept" and
| "refuse" both in green.
| layer8 wrote:
| According to GDPR, the site is not allowed to track you as
| long as you haven't given your consent. There is no
| agreement until you actually agreed to something.
| jauntywundrkind wrote:
| I wonder how AMD's XDNA accelerator will fair.
|
| They just shipped 1.0 of the Ryzen AI Software and SDK. Alleges
| ONNX, PyTorch, and Tensorflow support.
| https://www.anandtech.com/show/21178/amd-widens-availability...
|
| Interestingly, the upcoming XDNA2 supposedly is going to boost
| generative performance a lot? "3x". I'd kind of assumed these
| sort of devices would mainly be helping with inference. (I don't
| really know what characterizes the different workloads, just a
| naive grasp.)
| lars512 wrote:
| Is there a great speech generation model that runs on MacOS, to
| close the loop? Something more natural than the built in MacOS
| voices?
| treprinum wrote:
| You can try VALL-E; it takes around 5s to generate a sentence
| on a 3090 though.
| runjake wrote:
| Anyone have overall benchmarks or qualified speculation on how an
| optimized implementation for a 4070 compares against the M series
| -- especially the M3 Max?
|
| I'm trying to decide between the two. I figure the M3 Max would
| crush the 4070?
| sim7c00 wrote:
| looking at the comments perhaps the article could be more eptly
| titled. the author does stress these benchmarks, maybe better
| called test runs, are not of any scientific accuracy or worth,
| but simply to demonstrate what is being tested. i think its
| interesting though that apple and 4090s are even compared in any
| way since the devices are so vastly different. id expect the 4090
| to be more powerful, but apple optimized code runs really quick
| on apple silicon despite this seemingly obvious fact, and that i
| think is interesting. you dont need a 4090 to do things if you
| use the right libraries. is that what i can take from it?
| ex3ndr wrote:
| So running on M2 Ultra would beat 4090 by 30%? (since it has 2x
| of gpu cores)
| atty wrote:
| I think this is using the OpenAI Whisper repo? If they want a
| real comparison, they should be comparing MLX to faster-whisper
| or insanely-fast-whisper on the 4090. Faster whisper runs
| sequentially, insanely fast whisper batches the audio in 30
| second intervals.
|
| We use whisper in production and this is our findings: We use
| faster whisper because we find the quality is better when you
| include the previous segment text. Just for comparison, we find
| that faster whisper is generally 4-5x faster than OpenAI/whisper,
| and insanely-fast-whisper can be another 3-4x faster than faster
| whisper.
| moffkalast wrote:
| Is insanely-fast-whisper fast enough to actually run on the CPU
| and still trascribe in realtime? I see that none of these are
| running quantized models, it's still fp16. Seems like there's
| more speed left to be found.
|
| Edit: I see it doesn't yet support CPU inference, should be
| interesting once it's added.
| atty wrote:
| Insanely fast whisper is mainly taking advantage of a GPU's
| parallelization capabilities by increasing the batch size
| from 1 to N. I doubt it would meaningfully improve CPU
| performance unless you're finding that running whisper
| sequentially is leaving a lot of your CPU cores
| idle/underutilized. It may be more complicated if you have a
| matrix co-processor available, I'm really not sure.
| youssefabdelm wrote:
| Does insanely-fast-whisper use beam size of 5 or 1? And what is
| the speed comparison when set to 5?
|
| Ideally it also exposes that parameter to the user.
|
| Speed comparisons seem moot when quality is sacrificed for me,
| I'm working with very poor audio quality so transcription
| quality matters.
| atty wrote:
| Our comparisons were a little while ago so I apologize I
| can't remember if we used BS 1 or 5 - whichever we picked, we
| were consistent across models.
|
| Insanely fast whisper (god I hate the name) is really a CLI
| around Transformers' whisper pipeline, so you can just use
| that and use any of the settings Transformers exposes, which
| includes beam size.
|
| We also deal with very poor audio, which is one of the
| reasons we went with faster whisper. However, we have
| identified failure modes in faster whisper that are only
| present because of the conditioning on the previous segment,
| so everything is really a trade off.
| PH95VuimJjqBqy wrote:
| yeah well, I find that super-duper-insanely-fast-whisper is
| 3-4x faster than insanely-fast-whisper.
|
| /s
| atty wrote:
| Yes I am not a fan of the naming either :)
| accidbuddy wrote:
| About whisper, anyone knows a project (github) about using the
| model in real-time? I'm studying a new language, and it appears
| to be a good chance to use and learning pronunciation vs. word.
| brcmthrowaway wrote:
| Shocked that Apple hasn't released a high end compute chip
| competitive with NVIDIA
| atlas_hugged wrote:
| TL;DR
|
| If you compare whisper on a mac with Mac optimized build Vs on a
| pc with a few NON-optimized NVIDIA build The results are close!
| If nvidia optimized is compared, it's not even remotely close.
|
| Pfft
|
| I'll be picking up a Mac but I'm well aware it's not close to
| Nvidia at all. It's just the best portable setup I can find that
| I can run completely offline.
|
| Do people really need to make these disingenuous comparisons to
| validate their purchase?
|
| If a mac fits your overall use case better, get a Mac. If a pc
| with nvidia is the better choice, get it. Why all these articles
| of "look my choice wasn't that dumb"??
| 2lkj22kjoi wrote:
| 4090 -> 82 TFLOPS
|
| M3 MAX GPU -> 10 TFLOPS
|
| It is 8 times slower than 4090.
|
| But yeah, you can claim that a bike has a faster acceleration
| than Ferrari, because it could reach the speed of 1km per hour
| faster...
___________________________________________________________________
(page generated 2023-12-13 23:01 UTC)