hngopher.com

       [HN Gopher] Whisper: Nvidia RTX 4090 vs. M1 Pro with MLX
       ___________________________________________________________________
        
       Whisper: Nvidia RTX 4090 vs. M1 Pro with MLX
        
       Author : interpol_p
       Score  : 311 points
       Date   : 2023-12-13 14:52 UTC (8 hours ago)
        
 (HTM) web link (owehrens.com)
 (TXT) w3m dump (owehrens.com)
        
       | whywhywhywhy wrote:
       | Find these findings questionable unless Whisper is very poorly
       | optimized the way it was run on a 4090.
       | 
       | I have a 3090 and an M1 Max 32GB and and although I haven't tried
       | Whisper the inference difference on Llama and Stable Diffusion
       | between the two is staggering, especially with Stable Diffusion
       | where SDXL is about 0:09 seconds 3090 and 1:10 minute on M1 Max.
        
         | agloe_dreams wrote:
         | It all is really messy, I would assume that almost any model is
         | poorly optimized to run on Apple Silicon as well.
        
         | kamranjon wrote:
         | There has been a ton of optimization around whisper with
         | regards to apple silicon, whisper.cpp is a good example that
         | takes advantage of this - also this article is specifically
         | referencing the new apple MLX framework which I'm guessing your
         | tests with llama and stable diffusion weren't utilizing.
        
           | sbrother wrote:
           | I assume people are working on bringing a MLX backend to
           | llama.cpp... Any idea what the state of that project is?
        
             | tgtweak wrote:
             | https://github.com/ml-explore/mlx-examples
             | 
             | Several people working on mlx-enabled backends to popular
             | ML workloads but it seems inference workloads are the most
             | accelerated vs generative/training.
        
         | oceanplexian wrote:
         | M1 Max has 400GB/s of memory bandwidth and a 4090 has 1TB/s of
         | memory bandwidth, M1 Max has 32 GPU cores and a 4090 has
         | 16,000. The difference is more about how well the software is
         | optimized for the hardware platform than any performance
         | difference between the two, which are frankly not comparable in
         | any way.
        
           | codedokode wrote:
           | I think that 4090 has 16000 ALUs, not "cores" (let's call a
           | component capable to execute instructions independently from
           | others, a "core"). And M1 Max probably has more than 1 ALU in
           | every core, otherwise it resembles an ancient GPU.
        
             | rsynnott wrote:
             | Yeah; 'core' is a pretty meaningless term when it comes to
             | GPUs, or at least it's meaningless outside the context of a
             | particular architecture.
             | 
             | We may just be thankful that this particular bit of
             | marketing never caught on for CPUs.
        
           | segfaultbuserr wrote:
           | > _M1 Max has 32 GPU cores and a 4090 has 16,000._
           | 
           | Apple M1 Max has 32 GPU _cores_ , each core contains 16
           | _Execution Units_ , each EU has 8 _ALUs_ (also called
           | _shaders_ ), so overall there are 4096 shaders. Nvidia RTX
           | 4090 contains 12 _Graphics Processing Clusters_ , each GPC
           | has 12 _Streaming Multi-Processors_ , and each SM has 128
           | _ALUs_ , overall there are 18432 shaders.
           | 
           | A single shader is somewhat similar to a single lane of a
           | vector ALU in a CPU. One can say that a single-core CPU with
           | AVX-512 has 8 shaders, because it can process 8 FP64s at the
           | same time. Calling them "cores" (as in "CUDA core") is
           | extremely misleading, so "shader" became the common name for
           | a GPU's ALU due to that. If Nvidia is in charge of marketing
           | a 4-core x86-64 CPU, they would call it a CPU with 32 "AVX
           | cores" because each core has 8-way SIMD.
        
             | jrk wrote:
             | Actually each of those x86 CPUs probably has at least two
             | AVX FMA units, and can issue 16xFP32 FMAs per cycle - it's
             | at least "64 AVX cores"! :)
        
               | kimixa wrote:
               | Doesn't zen4 have 2x 256-bit FADD and 2x 256-bit FMA, and
               | with avx512 ops it double-pumps the ALU (a good overview
               | here [0]). If you count FADD as a single flop and FMA as
               | 2, that's 48 "1 flop cores" per core.
               | 
               | I think it's got the same total FP ALU resources as zen3,
               | and shows how register width and ALU resources can be
               | completely decoupled.
               | 
               | [0] https://www.mersenneforum.org/showthread.php?p=614191
        
           | stonemetal12 wrote:
           | Nvidia switched to marketing speak a long time ago when it
           | came to the word "core". If we go with Nvidia's definition
           | then M1 Max has 4096 cores, still behind the 4090, but the
           | gap isn't as big as 32 to 16k.
        
         | ps wrote:
         | I have 4090 and M1 Max 64GB. 4090 is far superior on Llama 2.
        
           | astrodust wrote:
           | On models < 24GB presumably. "Faster" depends on the model
           | size.
        
             | brucethemoose2 wrote:
             | In this case, the 4090 is far more memory efficient thanks
             | to ExLlamav2.
             | 
             | 70B in particular is indeed a significant compromise on the
             | 4090, but not as much as you'd think. 34B and down though,
             | I think Nvidia is unquestionably king.
        
               | michaelt wrote:
               | Doesn't running 70B in 24GB need 2 bit quantisation?
               | 
               | I'm no expert, but to me that sounds like a recipe for
               | bad performance. Does a 70B model in 2-bit really
               | outperform a smaller-but-less-quantised model?
        
           | jb1991 wrote:
           | But are you using the newly released Apple MLX optimizations?
        
             | ps wrote:
             | It's been approximately 2 months since I have tested it, so
             | probably not.
        
               | jb1991 wrote:
               | But those optimizations are the subject of the article
               | you are commenting on.
        
         | woadwarrior01 wrote:
         | You're taking benchmark numbers from a latent diffusion model's
         | (SDXL) inference and extrapolating them to encoder-decoder
         | transformer model's (Whisper) inference. These two model
         | architectures have little in common (except perhaps the fact
         | that Stable Diffusion models use a pre-trained text encoder
         | from clip, which again is very different from an encoder-
         | decoder transformer).
        
           | brucethemoose2 wrote:
           | The point still stands though. Popular models tend to to have
           | massively hand optimized Nvidia implementations.
           | 
           | Whisper is no exception:
           | https://github.com/Vaibhavs10/insanely-fast-whisper
           | 
           | SDXL is actually an interesting exception _for Nvidia_
           | because most users still tend to run it in PyTorch eager
           | mode. There are super optimized Nvidia implementations, like
           | stable-fast but their use is less common. Apple, on the other
           | hand, took the odd step of hand writing a Metal
           | implementation themselves, at least for SD 1.5.
        
             | tgtweak wrote:
             | Modest 30x speedup
        
             | kkielhofner wrote:
             | This will determine who has a shot at actually being Nvidia
             | competitive.
             | 
             | What I like to say is (generally speaking) other
             | implementations like AMD (ROCm), Intel, Apple, etc are
             | more-or-less at the "get it to work" stage. Due to their
             | early lead and absolute market dominance Nvidia has been at
             | the "wring every last penny of performance out of this"
             | stage for years.
             | 
             | Efforts like this are a good step but they still have a
             | very long way to go to compete with multiple layers
             | (throughout the stack) of insanely optimized Nvidia/CUDA
             | implementations. Bonus points nearly anything with Nvidia
             | is a docker command that just works on any chip they've
             | made in the last half decade from laptop to datacenter.
             | 
             | This can be seen (dramatically) with ROCm. I recently took
             | the significant effort (again) to get an LLM to run on an
             | AMD GPU. The AMD GPU is "cheaper" in initial cost but when
             | the dollar equivalent (to within 10-30%) Nvidia GPU is
             | 5-10x faster (or whatever) you're not saving anything.
             | 
             | You're already at a loss unless your time is free just to
             | get it to work (random patches, version hacks, etc) and
             | then the performance just isn't even close so the "value
             | prop" of AMD currently doesn't make any sense whatsoever.
             | The advantage for Apple is you likely spent whatever for
             | the machine anyway, and when you have it just sitting in
             | front of you for a variety of tasks the value prop
             | increases significantly.
        
             | woadwarrior01 wrote:
             | Although both LDM inference and encoder-decoder and
             | decoder-only LLM inference are both fundamentally
             | autoregressive in nature, LLM inference is memory bound and
             | LDM inference is compute bound. In that light, it makes
             | sense that the difference between a 4090 and M1 Pro isn't
             | as pronounced as one would expect at first approximation.
             | 
             | Also, as you hint whisper.cpp certainly isn't one of the
             | fastest implementations of whisper inference out there.
             | Perhaps a comparison between a pure PyTorch version running
             | on the 4090 with an MLX version of Whisper running on the
             | M1 Pro would be fairer. Or better yet, run the whisper
             | encoder on ANE with CoreML and have the decoder running
             | with Metal and Accelerate (which uses Apple's undocumented
             | AMX ISA) using MLX, since MLX currently does not use the
             | ANE. IIRC, whisper.cpp has a similar optimization on Apple
             | hardware, where it optionally runs the encoder using CoreML
             | and the decoder using Metal.
        
         | KingOfCoders wrote:
         | "I haven't tried Whisper"
         | 
         | I haven't tried the hardware/software/framework/... of the
         | article, but I have an opinion on this exact topic.
        
           | xxs wrote:
           | The topic is benchmarking some hardware and specific
           | implementation of some tool.
           | 
           | The provided context is n earlier version of hardware where
           | known implementations perform drastically differently, an
           | order of magnitude differently.
           | 
           | That leaves the question why that specific tool exhibits the
           | behavior described in the article.
        
         | tgtweak wrote:
         | Reading through some (admittedly very early) MLX docs and it
         | seems that convolutions (as used heavily in GANs and
         | particularly stable diffusion) are not really seeing meaningful
         | uplifts on MLX at all, and in some cases are slower than on the
         | cpu.
         | 
         | Not sure if this is a hardware limitation or just unoptimized
         | MLX libraries but I find it hard to believe they would have
         | just ignored this very prominent use case. It's more likely
         | that convolutions use high precision and much larger tile sets
         | that require some expensive context switching when the entire
         | transform can't fit in the gpu.
        
         | liuliu wrote:
         | Both of your SDXL and M1 Max number should be faster (of
         | course, it depends on how many steps). But the point stands,
         | for SDXL, 3090 should be 5x to 6x faster than M1 Max and should
         | be 2x to 2.5x faster than M2 Ultra.
        
         | stefan_ wrote:
         | Having used whisper a ton, there are versions of it that have
         | one or two magnitudes of better performance at the same quality
         | while using less memory for reasons I don't fully understand.
         | 
         | So I'd be very careful about your intuition on whisper
         | performance unless it's literally the same software and same
         | model (and then the comparison isn't very meaningful still,
         | seeing how we want to optimize it for different platforms).
        
         | mv4 wrote:
         | Thank you for sharing this data. I've just been debating
         | between M2 Mac Studio Max and a 64GB i9 10900x with RTX 3090
         | for personal ML use. Glad I chose the 3090! Would love to learn
         | more about your setup.
        
       | Flux159 wrote:
       | How does this compare to insanely-fast-whisper though?
       | https://github.com/Vaibhavs10/insanely-fast-whisper
       | 
       | I think that not using optimizations allows this to be a 1:1
       | comparison, but if the optimizations are not ported to MLX, then
       | it would still be better to use a 4090.
       | 
       | Having looked at MLX recently, I think it's definitely going to
       | get traction on Macs - and iOS when Swift bindings are released
       | https://github.com/ml-explore/mlx/issues/15 (although there might
       | be some C++20 compilation issue blocking right now).
        
         | brucethemoose2 wrote:
         | This is the thing about Nvidia. Even if some hardware beats
         | them in a benchmark, if its a popular model, there will be some
         | massively hand optimized CUDA implementation that blows
         | anything else out of the water.
         | 
         | There are some rare exceptions (like GPT-Fast on AMD thanks to
         | PyTorch's hard work on torch.compile, and only in a narrow use
         | case), but I can't think of a single one for Apple Silicon.
        
           | MBCook wrote:
           | I wouldn't be surprised a $2k top of the line GPU is a
           | match/better than the built in accelerator on a Mac. Even if
           | the Mac was slightly faster you could just stick multiple
           | GPUs in a PC.
           | 
           | To me the news here is how well the Mac runs without needing
           | that additional hardware/large power draw on this benchmark.
        
           | rfoo wrote:
           | > but I can't think of a single one for Apple Silicon.
           | 
           | The post here is exactly one for Apple Silicon. It compared a
           | naive implementation in PyTorch which may not even keep 4090
           | busy (for smaller/not-that-compute-intensive models having
           | the entire computation driven by Python is... limiting, which
           | is partly why torch.compile gives amazing improvements) to a
           | purposedly-optimized one (optimized for both CPU/GPU
           | efficiency) for Apple Silicon one.
        
           | jeroenhd wrote:
           | The one being benchmarked here is heavily optimised for Apple
           | Silicon. I think there are a few algorithms that Apple uses
           | (like the one tagging faces on iPhones) that are heavily
           | optimised for Apple's own hardware.
           | 
           | I think Apple's API would be as popular as CUDA if you could
           | rent their chips at scale. They're quite efficient machines
           | that don't need a lot of cooling, so I imagine the OPEX of
           | keeping them running 24/7 in big cloud racks would be pretty
           | low if they were optimised for server usage.
           | 
           | Apple seems to focus their efforts on bringing purpose-built
           | LLMs to Apple machines. I can see why it makes sense (just
           | like Google's attempts to bring Tensor cores to mobile) but
           | there's not much practical use in this technology right now.
           | Whisper is the first usable technology like this, but even my
           | Android phone can live translate spoken text into words as an
           | accessibility feature, I don't think Apple can sell Whisper
           | as a product to end users.
        
             | jdminhbg wrote:
             | > The one being benchmarked here is heavily optimised for
             | Apple Silicon.
             | 
             | I don't think so, in the sense of a hand-optimized CUDA
             | implementation. This just using the MLX API in the same way
             | that you'd use CUDA via PyTorch or something.
        
             | mac-mc wrote:
             | Apple would need to make rackmount versions of the machines
             | with replaceable storage and maybe RAM and would really
             | need to really beef up their headless management systems of
             | the machines before they start becoming competitive.
             | 
             | Otherwise you need a whole bunch of custom mac mini style
             | racks and management software which really increases costs
             | and lead times. If you don't believe me, look how expensive
             | AWS macOS machines are compared to linux ones with
             | equivalent performance.
        
               | poyu wrote:
               | They already make rack mount Mac Pros. But yeah they need
               | to up their game on the management software
        
         | claytonjy wrote:
         | To have a good comparison I think we'd need to run the
         | insanely-fast-whisper code on a 4090. I bet it handily beats
         | both the benchmarks in OP, though you'll need a much smaller
         | batch size than 24.
         | 
         | You can beat these benchmarks on a CPU; 3-4x realtime is very
         | slow for whisper these days!
        
         | chrisbrandow wrote:
         | he updated with insanely-fast
        
       | bcatanzaro wrote:
       | What precision is this running in? If 32-bit, it's not using the
       | tensor cores in the 4090.
        
       | DeathArrow wrote:
       | Ok, OpenAI will ditch Nvidia and buy macs instead. :)
        
         | baldeagle wrote:
         | Only if Sam Altman is appointed to the Apple board. ;)
        
       | tiffanyh wrote:
       | Key to this article is understanding it's leveraging the newly
       | released Apple MLX, and their code is using these Apple specific
       | optimizations.
       | 
       | https://news.ycombinator.com/item?id=38539153
        
         | modeless wrote:
         | Also, this is not comparing against an optimized Nvidia
         | implementation. There are faster implementations of Whisper.
         | 
         | Edit: OK I took the bait. I downloaded the 10 minute file he
         | used and ran it on _my_ 4090 with insanely-fast-whisper, which
         | took two commands to install. Using whisper-large-v3 the file
         | is transcribed in less than eight seconds. Fifteen seconds if
         | you include the model loading time before transcription starts
         | (obviously this extra time does not depend on the length of the
         | audio file).
         | 
         | That makes the 4090 somewhere between 6 and 12 _times_ faster
         | than Apple 's best. It's also _much_ cheaper than M2 Ultra if
         | you already have a gaming PC to put it in, and _still_ cheaper
         | even if you buy a whole prebuilt PC with it.
         | 
         | This should not be surprising to people, but I see a lot of
         | wishful thinking here from people who own high end Macs and
         | want to believe they are good at everything. Yes, Apple's
         | M-series chips are very impressive and the large RAM is great,
         | but they are _not_ competitive with Nvidia at the high end for
         | ML.
        
           | isodev wrote:
           | It also wasn't optimised for Apple Silicon. Given how the
           | different platforms performed in this test, the conclusions
           | seem pretty solid.
        
             | jbellis wrote:
             | He is literally comparing whisper.cpp on the 4090 with an
             | optimized-for-apple-silicon-by-apple-engineers version on
             | the M1.
             | 
             | ETA: actually it's unclear from the article if the whisper
             | optimizations were done by apple engineers, but it's
             | definitely an optimized version.
        
               | rowanG077 wrote:
               | I don't think whisper was optimized for apple silicon.
               | Doesn't it just use MLX? I mean if using an API for a
               | platform counts as specifically optimized then the Nvidia
               | version is "optimized" as well since it's probably using
               | CUDA.
        
               | isodev wrote:
               | Maybe I'm not seeing it right, but comparing the source
               | of Apple's Whisper to Python Whisper seems there are
               | minimal changes to redirect certain operations to using
               | MLX.
               | 
               | There is also cpp Whisper
               | (https://github.com/ggerganov/whisper.cpp) which seems to
               | have it's own kind of optimizations for Apple Silicon - I
               | don't think this was the one used with Nvidia during the
               | test.
        
           | swores wrote:
           | Would you be so kind as to link to a guide for your method or
           | share it in a comment yourself?
           | 
           | I installed following the official docs and found it much,
           | much slower, although I sadly don't have a 4090, instead a
           | 3080 Ti 12GB (just big enough to load the large whisper model
           | into GPU memory).
        
             | modeless wrote:
             | I'm running on Linux with a 13900k, 64 GB RAM, and I
             | already have CUDA installed. Install commands directly from
             | the README:                   pipx install insanely-fast-
             | whisper              pipx runpip insanely-fast-whisper
             | install flash-attn --no-build-isolation
             | 
             | To transcribe the file:                   insanely-fast-
             | whisper --flash True --file-name ~/Downloads/podcast_1652_w
             | as_jetzt_episode_1289963_update_warum_streiken_sie_schon_wi
             | eder_herr_zugchef.mp3 --language german --model-name
             | openai/whisper-large-v3
             | 
             | The file can be downloaded at: https://adswizz.podigee-
             | cdn.net/version/1702050198/media/pod...
             | 
             | I just ran it again and happened to get an even better
             | time, under 7 seconds without loading and 13.08 seconds
             | including loading. In case anyone is curious about the use
             | of Flash Attention, I tried without it and transcription
             | took under 10 seconds, 15.3 including loading.
        
               | owehrens wrote:
               | I just can't get it to work, it errors out with
               | 'NotImplementedError: The model type whisper is not yet
               | supported to be used with BetterTransformer.' Did you
               | happen to run into this problem?
        
               | modeless wrote:
               | Sorry, I didn't encounter that error. It worked on the
               | first try for me. I have wished many times that the ML
               | community didn't settle on Python for this reason...
        
               | owehrens wrote:
               | Thanks so much again. Got it working. 8 seconds. Nvidia
               | is the king. Updated the blog post.
        
               | darkteflon wrote:
               | I think insanely-faster-whisper uses batching, so faster-
               | whisper (which doesn't) might be a fairer comparison for
               | the purposes of your post.
        
               | swores wrote:
               | Thanks!
               | 
               | Another question that's only slightly related, but while
               | we're here...
               | 
               | Using OAI's paid Whisper API, you can give a text prompt
               | to a) set the tone/style of the transcription and b)
               | teach it technical terms, names etc that it might not be
               | familiar with and should expect in the audio to
               | transcribe.
               | 
               | Am I correct that this isn't possible with any released
               | versions of Whisper, or is there a way to do it on my
               | machine that I'm not aware of?
        
               | modeless wrote:
               | You can definitely do this with the open source version.
               | Many transcription implementations use it to maintain
               | context between the max-30-second chunks Whisper natively
               | supports.
        
               | swores wrote:
               | I'll try to understand some of how stuff like faster-
               | whisper works when I've got time over the weekend, but I
               | fear it may be too complex for me...
               | 
               | I was rather hoping for a guide of just how to either
               | adapt classic whisper usage or adapt one of the optimised
               | ones like faster-whisper (which I've just set up in a
               | docker container but that's used up all the time I've got
               | for playing around right now) to take a text prompt with
               | the audio file.
        
               | sundvor wrote:
               | Cheers, I've been wanting to get into doing something
               | else with my 4090 order than multi monitor simulator
               | gaming, quad screen workstation work - and this will get
               | me kicked off!
               | 
               | The 4090 is an absolute beast, runs extremely quiet and
               | simply powers through everything. DCS pushes it to the
               | limit, but the resulting experience is simply stunning.
               | Mine's coupled to a 7800x3d which uses hardly any power
               | at all, absolutely love it.
        
               | modeless wrote:
               | If you're looking for something easy to try out, try my
               | early demo that hooks Whisper to an LLM and TTS so you
               | can have a real time speech conversation with your local
               | GPU that feels like talking to a person! It's way faster
               | than ChatGPT:
               | https://apps.microsoft.com/detail/9NC624PBFGB7
        
           | owehrens wrote:
           | You mean 'https://github.com/Vaibhavs10/insanely-fast-
           | whisper' ? Did not know that until now. I'm running all of
           | that since over ~10 months and just had it running. Happy to
           | try that out. The GPU is fully utilized by using whisper and
           | pytorch with cuda and all. Thanks for the link.
        
           | WhitneyLand wrote:
           | I'm afraid the article as well as your benchmarks can be
           | misleading because there are a lot of different whisper
           | implementations out there.
           | 
           | For ex ctranslate optimized whisper "implements a custom
           | runtime that applies many performance optimization techniques
           | such as weights quantization, layers fusion, batch
           | reordering..."
           | 
           | Intuitively, I would agree with your conclusion about Apple's
           | M-Series being impressive for what they do but not generally
           | competitive with Nvidia in ML.
           | 
           | Objectively however, I don't see concluding much with what's
           | on offer here. Once you start changing libraries, kernels,
           | transformer code, etc you end up with an apples to oranges
           | comparison.
        
             | modeless wrote:
             | I think it's fair to compare the fastest available
             | implementation on both platforms. I suspect that the MLX
             | version can be optimized further. However, it will not
             | close a 10x gap.
        
           | darkteflon wrote:
           | Surely the fact that IFW uses batching makes it apples to
           | oranges? The MLX-enabled version didn't batch, did it? That
           | fundamentally changes the nature of the operation. Wouldn't
           | the better comparison be faster-whisper?
        
             | modeless wrote:
             | I don't know exactly what the MLX version did, but you're
             | probably right. I'd love to see the MLX side optimized to
             | the max as well. I'm confident that it would not reach the
             | performance of the 4090, but it might do better.
             | 
             | That said, for practical purposes, the ready availability
             | of Nvidia-optimized versions of every ML system is a big
             | advantage in itself.
        
               | darkteflon wrote:
               | Yeah, I think everyone knows that Nvidia is doing a
               | cracker job. But it is good to just be specific about
               | these benchmarks because numbers get thrown around and it
               | turns out people are testing different things. The other
               | thing is that Apple is extracting this performance on a
               | laptop, at ~1/8 the power draw of the desktop Nvidia
               | card.
               | 
               | In any event, it's super cool to see such huge leaps just
               | in the past year on how easy it is to run this stuff
               | locally. Certainly looking very promising.
        
               | modeless wrote:
               | The M2 Ultra that got the best numbers that I was
               | comparing to is not in a laptop. Regardless, you're
               | probably right that the power consumption is
               | significantly lower per unit time. However, is it lower
               | per unit work done? It would be interesting to see a
               | benchmark optimized for power. Nvidia's power consumption
               | can usually be significantly reduced without much
               | performance cost.
               | 
               | Also, the price difference between a prebuilt 4090 PC and
               | a M2 Ultra Mac Studio can buy a lot of kilowatt hours.
        
           | justinclift wrote:
           | > Nvidia at the high end for ML.
           | 
           | Wouldn't the high end for Nvidia be their dedicated gear
           | rather than a 4090?
        
             | modeless wrote:
             | Yes, H100 would be faster still, and Grace Hopper perhaps
             | even somewhat faster. But Apple doesn't have comparable
             | datacenter-only products, so it's also interesting to see
             | the comparison of consumer hardware. Also, 4090 is cheaper
             | than Apple's best, but H100 is more than both (if you are
             | even allowed to buy it).
        
         | intrasight wrote:
         | Honest question: Why would I (or most users) care? If I have a
         | Mac, I'm going to get the performance of that machine. If I
         | have a gaming PC, I'll get the performance of that machine. If
         | I have both, I'm still likely to use whichever AI is running on
         | my daily driver.
        
           | tomatotomato31 wrote:
           | Curiosity for example.
           | 
           | Or having a better image of performance in your head when you
           | buy new hardware.
           | 
           | It's just a blog article. The time and effort to make it and
           | to consume it is not in the range of millions there is no
           | need for 'more'.
        
           | winwang wrote:
           | I have both. I drive both depending on context of my day and
           | where I want to work. If I were interested in this, and the
           | 4090 were faster, I'd just ssh into that machine from my Mac
           | if I wanted to use my Mac -- and presumably, vice versa.
        
           | shikon7 wrote:
           | Suppose you're a serious hobbyist, wanting to buy the most
           | powerful consumer device for deep learning. Then the choice
           | of buying a RTX 4090 or a M3 Mac is probably quite
           | interesting.
           | 
           | Personally, I have already a RTX 3090, so for me it's
           | interesting if a M3 would be a noticeable upgrade
           | (considering RAM, speed, library support)
        
             | light_hue_1 wrote:
             | It's really not interesting. The Nvidia card is far
             | superior.
             | 
             | The super optimized code for a network like this is one
             | test case. General networks are another.
        
               | mi_lk wrote:
               | > The Nvidia card is far superior.
               | 
               | I mean, you're probably right, but MLX framework was
               | released just a week ago so maybe we don't really know
               | what it's capable of yet.
        
               | mac-mc wrote:
               | The "available VRAM" is the big difference. You get a lot
               | more to use on an apple silicon machine for much cheaper
               | than you get on an nvidia card on a $/GB basis,
               | especially if you want to pass the 24GB threshold. You
               | can't use NVIDIA consumer cards after that and there is a
               | shortage of them.
               | 
               | The power:perf ratios seem about equal although, so it's
               | really up to apple to release equivalently sized silicon
               | on their desktop to really have this be a 1:1 comparison.
        
               | swores wrote:
               | I wonder how complicated it would be to make happen / how
               | likely it is to happen, for the Apple way of sharing
               | memory between CPU and GPU to become a thing on PCs.
               | Could NVIDIA do it unilaterally or would it need to be in
               | partnership with other companies?
               | 
               | Could anyone who understands this hardware better than me
               | chime in on the complexities of bringing unified ram/vram
               | to PC, or reasons it hasn't happened yet?
        
       | Lalabadie wrote:
       | There will be a lot of debate about which is the absolute best
       | choice for X task, but what I love about this is the level of
       | performance at such a low power consumption.
        
       | SlavikCA wrote:
       | It's easy to run Whisper on my Mac M1. But it's not using MLX out
       | of the box.
       | 
       | I spend an hour or two, trying to run figure out what I need to
       | install / configure to enable it to use MLX. Was getting cryptic
       | Python errors, Torch errors... Gave up on it.
       | 
       | I rented VM with GPU, and started Whisper on it within few
       | minutes.
        
         | xd1936 wrote:
         | I've really enjoyed this macOS Whisper GUI[1]. It doesn't use
         | MLX, but does use Metal.
         | 
         | 1. https://goodsnooze.gumroad.com/l/macwhisper
        
         | tambourine_man wrote:
         | Is was released last week. Give it a month or two
        
         | JCharante wrote:
         | Hmm, I've been using this product for whisper
         | https://betterdictation.com/
        
         | jonnyreiss wrote:
         | I was able to get it running on MLX on my M2 Max machine within
         | a couple minutes using their example: https://github.com/ml-
         | explore/mlx-examples/tree/main/whisper
        
       | etchalon wrote:
       | The shocking thing about these M series comparisons is never "the
       | M series is fast as the GIANT NVIDIA THING!" it's always "Man,
       | the M series is 70% as fast with like 1/4 the power."
        
         | kllrnohj wrote:
         | It's not really that shocking. Power consumption is non-linear
         | with respect to frequency and you see this all the time in high
         | end CPU & GPU parts. Look at something like the eco modes on
         | the Ryzen 7xxx for a great example. The 7950X stock pulls
         | something like 260w on an all core workload at 5.1ghz. Yet
         | enable the 105w eco mode and that power consumption _plummets_
         | to 160w at 4.8ghz. That means the last 300mhz of performance,
         | which is borderline inconsequential (~6% performance loss),
         | costs _100W_. The 65W option then cuts that in half almost
         | again down to 88w (at 4ghz now), for a  "mere" 20% reduction in
         | performance. Or phrased differently, for 1/3rd the power the
         | 7950X will give you 75% of the performance of a 7950X.
         | 
         |  _Matching_ performance while using less power is impressive.
         | Using less power while also being slower not so much, though.
        
           | nottorp wrote:
           | > Or phrased differently, for 1/3rd the power the 7950X will
           | give you 75% of the performance of a 7950X.
           | 
           | So where is the - presumably much cheaper - 65 W 7950X?
        
             | dotnet00 wrote:
             | Why would it be much cheaper? The chips are intentionally
             | clocked higher than the most efficient point because the
             | point of the CPU is raw speed, not power consumption,
             | especially since 7950x is a desktop chip.
             | 
             | His point is that M-CPUs being somewhat competitive but
             | much more efficient is not as stunning since you're
             | comparing a CPU tuned to be at the most efficient speed to
             | a CPU tuned to be at the highest speed.
             | 
             | Similarly a 4090's power consumption drops dramatically if
             | you underclock or undervolt even slightly, but what's the
             | point? You're almost definitely buying a 4090 for its raw
             | speed.
        
               | nottorp wrote:
               | Because they would be able to sell some rejects.
               | 
               | And because I don't want a software limit that may or may
               | not work.
        
               | wtallis wrote:
               | AMD's chiplet-based design means they have plenty of
               | other ways to make good use of parts that cannot hit the
               | highest clock speeds. They have very little reason to do
               | a 16-core low-clock part for their consumer desktop
               | platform.
               | 
               | And your concerns about "a software limit that may or may
               | not work" are completely at odds with how their power
               | management works.
        
               | dotnet00 wrote:
               | It isn't a "software limit" beyond just being controlled
               | by writing to certain CPU registers via software. It's
               | very much a feature of the hardware, the same feature
               | that allows for overclocking the chips.
        
               | kllrnohj wrote:
               | That's not how binning works. Quality silicon is also the
               | ones that are more efficient. A flagship 65W part would
               | be just as expensive as a result, it's the same-ish
               | quality of parts.
        
             | coder543 wrote:
             | _Every_ 7950X offers a 65W mode. It's not a separate SKU.
             | 
             | It's a choice each user can make if they care more about
             | efficiency. Tasks take longer to complete, but the total
             | energy consumed for the completion of the task is
             | dramatically less.
        
               | nottorp wrote:
               | You think? I think it's a choice that very few technical
               | users even know about. And of those, 90% don't care about
               | efficiency.
               | 
               | The 250 W space heater vacuum cleaner soundtrack mode
               | should be opt in rather than opt out. Same for video
               | cards.
        
               | coder543 wrote:
               | Of course most users don't pick the 65W option. They want
               | maximum performance, and the cost of electricity is
               | largely negligible to most people buying a 7950X.
               | 
               | AMD isn't going to offer a huge discount for a 65W 7950X
               | for the reasons discussed elsewhere: they don't need to.
        
         | lern_too_spel wrote:
         | As others have pointed out, it is nowhere near even 1/4 as
         | fast.
        
           | etchalon wrote:
           | > The result for a 10 Minute audio is 0:03:36.296329 (216
           | seconds). Compare that to 0:03:06.707770 (186 seconds) on my
           | Nvidia 4090. The 2000 EUR GPU is still 30 seconds or ~ 16%
           | faster. All graphics core where fully utilized during the run
           | and I quit all programs, disabled desktop picture or similar
           | for that run.
        
             | lern_too_spel wrote:
             | What other people have mentioned is that there are multiple
             | implementations for Nvidia GPUs that are many times faster
             | than whisper.cpp.
        
               | etchalon wrote:
               | And my comment was about M1 comparison articles, of which
               | this was one. And which exhibited the property I
               | mentioned.
        
               | lern_too_spel wrote:
               | It does this poorly. They compared the currently most
               | optimized whisper implementation for M1 against an
               | implementation that is far from the best currently
               | available for the Nvidia GPU. They cannot make a claim of
               | reaching 70% speed at 25% power usage.
        
       | tgtweak wrote:
       | Does this translate to other models or was whisper cherry picked
       | due to it's serial nature and integer math? looking at
       | https://github.com/ml-explore/mlx-examples/tree/main/stable_...
       | seems to hint that this is the case:
       | 
       | >At the time of writing this comparison convolutions are still
       | some of the least optimized operations in MLX.
       | 
       | I think the main thing at play is the fact you can have 64+G of
       | very fast ram directly coupled to the cpu/gpu and the benefits of
       | that from a latency/co-accessibility point of view.
       | 
       | These numbers are certainly impressive when you look at the power
       | packages of these systems.
       | 
       | Worth considering/noting that the cost of m3 max system with the
       | minimum ram config is ~2x the price of a 4090...
        
         | densh wrote:
         | Apple's silicon memory is fast only in comparison to consumer
         | CPUs that stagnated for ages with having only 2 memory channels
         | which was fine in 4 core era but wakes no sense at all with
         | modern core counts. Memory scaling on GPUs is much better, even
         | on the consumer front.
        
       | mightytravels wrote:
       | Use this Whisper derivative repo instead - one hour of audio gets
       | transcribed within a minute or less on most GPUs -
       | https://github.com/Vaibhavs10/insanely-fast-whisper
        
         | thrdbndndn wrote:
         | Could someone elaborate how this is accomplished and if there
         | is any quality disparity compared to original?
         | 
         | Repos like https://github.com/SYSTRAN/faster-whisper makes
         | immediate sense on why it's faster than the original
         | implementation, and lots of others do so by lowering
         | quantization precision etc (and worse results).
         | 
         | but this one, it's not very clear how. Especially considering
         | it's even much faster.
        
           | lern_too_spel wrote:
           | The Acknowledgments section on the page that GP shared says
           | it's using BetterTransformer. https://huggingface.co/docs/opt
           | imum/bettertransformer/overvi...
        
           | mightytravels wrote:
           | From what I can see it is parallel batch processing - default
           | for that repo is 24. You can reduce batches and if you use 1
           | it's as fast or slow as Whisper. Quality is the exact same
           | (same large model used).
        
         | claytonjy wrote:
         | Anecdotally I've found ctranslate2 to be even faster than
         | insanely-fast-whisper. On an L4, using ctranslate2 with a batch
         | size as low as 4 beats all their benchmarks except the A100
         | with flash attention 2.
         | 
         | It's a shame faster-whisper never landed batch mode, as I think
         | that's preventing folks from trying ctranslate2 more easily.
        
       | theschwa wrote:
       | I feel like this is particularly interesting in light of their
       | Vision Pro. Being able to run models in a power efficient manner
       | may not mean much to everyone on a laptop, but it's a huge
       | benefit for an already power hungry headset.
        
       | darknoon wrote:
       | Would be more interesting if Pytorch with MPS backend was also
       | included.
        
       | LiamMcCalloway wrote:
       | I'll take this opportunity to ask for help: What's a good open
       | source transcription and diarization app or work flow?
       | 
       | I looked at https://github.com/thomasmol/cog-whisper-diarization
       | and https://about.transcribee.net/ (from the people behind
       | Audapolis) but neither work that well -- crashes, etc.
       | 
       | Thank you!
        
         | dvfjsdhgfv wrote:
         | I developed my own solutions, pretty rudimentary - it divides
         | the MP3s into chunks that Whisper is able to handle and then
         | sends them one by one to the API to transcribe. Works as
         | expected so far, it's just a couple of lines of Python code.
        
         | mosselman wrote:
         | I would like to know the same.
         | 
         | It shouldn't be so hard since many apps have this. But what is
         | the most reliable way right now?
        
       | throwaw33333434 wrote:
       | META: is M3 Pro good enough to run Cyberpunk 2077 smoothly? Does
       | Max really makes a difference?
        
         | ed_balls wrote:
         | 14 inch M3 Max may overheat.
        
       | iAkashPaul wrote:
       | There's a better parallel/batching that works on the 30s chunks
       | resulting in 40X. From HF at
       | https://github.com/Vaibhavs10/insanely-fast-whisper
       | 
       | This is again not native PyTorch so there's still room to have
       | better RTFX numbers.
        
       | bee_rider wrote:
       | Hmm... this is a dumb question, but the cookie pop up appears to
       | be in German on this site. Does anyone know which button to press
       | to say "maximally anti-tracking?"
        
         | layer8 wrote:
         | If only we had a way to machine-translate text or to block such
         | popups.
        
           | bee_rider wrote:
           | I think it is better not to block these sorts of pop-ups,
           | they are part of the agreement to use the site after all.
           | 
           | Anyway the middle button is "refuse all" according to my
           | phone, not sure how accurate the translation is or if they'll
           | shuffle the buttons for other people.
           | 
           | It is poor design to have what appear to be "accept" and
           | "refuse" both in green.
        
             | layer8 wrote:
             | According to GDPR, the site is not allowed to track you as
             | long as you haven't given your consent. There is no
             | agreement until you actually agreed to something.
        
       | jauntywundrkind wrote:
       | I wonder how AMD's XDNA accelerator will fair.
       | 
       | They just shipped 1.0 of the Ryzen AI Software and SDK. Alleges
       | ONNX, PyTorch, and Tensorflow support.
       | https://www.anandtech.com/show/21178/amd-widens-availability...
       | 
       | Interestingly, the upcoming XDNA2 supposedly is going to boost
       | generative performance a lot? "3x". I'd kind of assumed these
       | sort of devices would mainly be helping with inference. (I don't
       | really know what characterizes the different workloads, just a
       | naive grasp.)
        
       | lars512 wrote:
       | Is there a great speech generation model that runs on MacOS, to
       | close the loop? Something more natural than the built in MacOS
       | voices?
        
         | treprinum wrote:
         | You can try VALL-E; it takes around 5s to generate a sentence
         | on a 3090 though.
        
       | runjake wrote:
       | Anyone have overall benchmarks or qualified speculation on how an
       | optimized implementation for a 4070 compares against the M series
       | -- especially the M3 Max?
       | 
       | I'm trying to decide between the two. I figure the M3 Max would
       | crush the 4070?
        
       | sim7c00 wrote:
       | looking at the comments perhaps the article could be more eptly
       | titled. the author does stress these benchmarks, maybe better
       | called test runs, are not of any scientific accuracy or worth,
       | but simply to demonstrate what is being tested. i think its
       | interesting though that apple and 4090s are even compared in any
       | way since the devices are so vastly different. id expect the 4090
       | to be more powerful, but apple optimized code runs really quick
       | on apple silicon despite this seemingly obvious fact, and that i
       | think is interesting. you dont need a 4090 to do things if you
       | use the right libraries. is that what i can take from it?
        
       | ex3ndr wrote:
       | So running on M2 Ultra would beat 4090 by 30%? (since it has 2x
       | of gpu cores)
        
       | atty wrote:
       | I think this is using the OpenAI Whisper repo? If they want a
       | real comparison, they should be comparing MLX to faster-whisper
       | or insanely-fast-whisper on the 4090. Faster whisper runs
       | sequentially, insanely fast whisper batches the audio in 30
       | second intervals.
       | 
       | We use whisper in production and this is our findings: We use
       | faster whisper because we find the quality is better when you
       | include the previous segment text. Just for comparison, we find
       | that faster whisper is generally 4-5x faster than OpenAI/whisper,
       | and insanely-fast-whisper can be another 3-4x faster than faster
       | whisper.
        
         | moffkalast wrote:
         | Is insanely-fast-whisper fast enough to actually run on the CPU
         | and still trascribe in realtime? I see that none of these are
         | running quantized models, it's still fp16. Seems like there's
         | more speed left to be found.
         | 
         | Edit: I see it doesn't yet support CPU inference, should be
         | interesting once it's added.
        
           | atty wrote:
           | Insanely fast whisper is mainly taking advantage of a GPU's
           | parallelization capabilities by increasing the batch size
           | from 1 to N. I doubt it would meaningfully improve CPU
           | performance unless you're finding that running whisper
           | sequentially is leaving a lot of your CPU cores
           | idle/underutilized. It may be more complicated if you have a
           | matrix co-processor available, I'm really not sure.
        
         | youssefabdelm wrote:
         | Does insanely-fast-whisper use beam size of 5 or 1? And what is
         | the speed comparison when set to 5?
         | 
         | Ideally it also exposes that parameter to the user.
         | 
         | Speed comparisons seem moot when quality is sacrificed for me,
         | I'm working with very poor audio quality so transcription
         | quality matters.
        
           | atty wrote:
           | Our comparisons were a little while ago so I apologize I
           | can't remember if we used BS 1 or 5 - whichever we picked, we
           | were consistent across models.
           | 
           | Insanely fast whisper (god I hate the name) is really a CLI
           | around Transformers' whisper pipeline, so you can just use
           | that and use any of the settings Transformers exposes, which
           | includes beam size.
           | 
           | We also deal with very poor audio, which is one of the
           | reasons we went with faster whisper. However, we have
           | identified failure modes in faster whisper that are only
           | present because of the conditioning on the previous segment,
           | so everything is really a trade off.
        
         | PH95VuimJjqBqy wrote:
         | yeah well, I find that super-duper-insanely-fast-whisper is
         | 3-4x faster than insanely-fast-whisper.
         | 
         | /s
        
           | atty wrote:
           | Yes I am not a fan of the naming either :)
        
       | accidbuddy wrote:
       | About whisper, anyone knows a project (github) about using the
       | model in real-time? I'm studying a new language, and it appears
       | to be a good chance to use and learning pronunciation vs. word.
        
       | brcmthrowaway wrote:
       | Shocked that Apple hasn't released a high end compute chip
       | competitive with NVIDIA
        
       | atlas_hugged wrote:
       | TL;DR
       | 
       | If you compare whisper on a mac with Mac optimized build Vs on a
       | pc with a few NON-optimized NVIDIA build The results are close!
       | If nvidia optimized is compared, it's not even remotely close.
       | 
       | Pfft
       | 
       | I'll be picking up a Mac but I'm well aware it's not close to
       | Nvidia at all. It's just the best portable setup I can find that
       | I can run completely offline.
       | 
       | Do people really need to make these disingenuous comparisons to
       | validate their purchase?
       | 
       | If a mac fits your overall use case better, get a Mac. If a pc
       | with nvidia is the better choice, get it. Why all these articles
       | of "look my choice wasn't that dumb"??
        
       | 2lkj22kjoi wrote:
       | 4090 -> 82 TFLOPS
       | 
       | M3 MAX GPU -> 10 TFLOPS
       | 
       | It is 8 times slower than 4090.
       | 
       | But yeah, you can claim that a bike has a faster acceleration
       | than Ferrari, because it could reach the speed of 1km per hour
       | faster...
        
       ___________________________________________________________________
       (page generated 2023-12-13 23:01 UTC)