[HN Gopher] Whispers of A.I.'s Modular Future
       ___________________________________________________________________
        
       Whispers of A.I.'s Modular Future
        
       Author : pgayed
       Score  : 143 points
       Date   : 2023-02-21 05:23 UTC (17 hours ago)
        
 (HTM) web link (www.newyorker.com)
 (TXT) w3m dump (www.newyorker.com)
        
       | ilaksh wrote:
       | Whisper is great, but Google's built-in speech-to-text thing in
       | Chrome with https://developer.mozilla.org/en-
       | US/docs/Web/API/Web_Speech_... has been available for quite some
       | time and works well. Not sure if Firefox has a implementation,
       | probably one that works locally but not nearly as well.
       | 
       | Obviously not having to send all of the data to Google is a big
       | deal. But practically speaking the Google recognition seems to
       | perform extremely well. So not sure this is really a new
       | capability for people if they were willing to use Google's
       | servers.
       | 
       | It seems like Firefox and Chrome should now ship with Whisper
       | built in. Or they should work on that.. and if Chrome doesn't add
       | it then.. that's suspicious.
        
         | qwertox wrote:
         | You may want to read my comment which I posted to another
         | comment, it has a comparison to Google's ASR.
        
         | krisoft wrote:
         | > So not sure this is really a new capability for people if
         | they were willing to use Google's servers.
         | 
         | Of course it is. Google is providing the API at their largess.
         | One day they can decide they won't do it anymore, or they will
         | charge you huge piles of money, or they decide they don't like
         | something about your data and they won't process it even for
         | all the money. Or will do it but only send back nonsense to
         | you. It is their endpoint and they do whatever they are pleased
         | with it.
         | 
         | On the other hand you can take Whisper and use it on your own
         | computer. Or rent a computer from anyone who does that kind of
         | service. If you don't like how it performs you can refine it on
         | your own datasets. If you have tens of thousands hours of sound
         | files you want to convert you can just calculate what it will
         | cost you to convert with Whisper. With google? Good luck.
         | 
         | The difference between running Whisper on your own hardware vs
         | using Google's endpoint is the difference between having your
         | own car vs the mercurial rich guy not minding if skaters hold
         | on his taillights to hitch a ride for now.
        
       | jbotz wrote:
       | The article concludes:
       | 
       | "Eventually, though, someone will release a program that's nearly
       | as capable as ChatGPT, and entirely open-source. An enterprising
       | amateur will find a way to make it run for free on your laptop."
       | 
       | Well not in the near future... Large language models like ChatGPT
       | are called "large" for a reason, and they're too big for your
       | laptop for the foreseeable future. I think you'd need a computer
       | with several high-end GPUs that each have 128+GB of RAM to be
       | able to run one of those. Maybe a laptop from 2030 will do.
       | 
       | Otherwise a very good article.
        
         | wyldfire wrote:
         | > to be able to run one of those
         | 
         | It's very important to distinguish the different use cases
         | between training and inference. The amount of memory required
         | to execute the ChatGPT model, once trained, is likely much,
         | much less than 128GB.
        
           | dxuh wrote:
           | Afaik you actually need much more. About 400GB just to load
           | the trained model according to this tweet thread: https://twi
           | tter.com/tomgoldsteincs/status/160019698195510069... I am not
           | quite sure how reliable the source is, but it makes sense
           | that you at least need to store the 175 billion parameters
           | that define the model in VRAM. I know that for a short-ish
           | while GPUDirect storage is a thing, so that could help for
           | sure, but it would definitely impact execution time as well.
        
             | moyix wrote:
             | That thread makes a bunch of assumptions that seem a bit
             | dubious to me. We've known since Chinchilla that you don't
             | need 175B parameters to get GPT-3 quality - a 70B model can
             | outperform GPT3 [1]. And his numbers assume the model is
             | loaded into GPU memory in FP16 (175B*2 = 350GB), but people
             | have shown you can quantize down to 8-bit (and in some
             | cases 4 bit) with almost no performance loss. So in 8-bit
             | precision with a 70B model you need ~70GB of VRAM, which
             | you can get with two A6000s on a desktop (each 48GB).
             | 
             | And finally there are lots of other ways to get this down.
             | Aside from quantization, people have also shown that you
             | can do pruning - getting rid of many of the weights - again
             | without much perf loss. You can also offload the weights to
             | CPU RAM or an NVME and stream them in as needed [2]; it's
             | slower but if you arrange things right the performance is
             | not too bad. There are also ways to speed up inference
             | using techniques like early exit [3], where you can skip
             | running the whole model for some tokens that are easy to
             | predict.
             | 
             | Overall it feels like within a year or two a combination of
             | better quantization/pruning, improved understanding of how
             | to train smaller LLMs, and hardware improvements will put
             | inference for ChatGPT-style models within reach of the
             | average user.
             | 
             | [1] https://towardsdatascience.com/a-new-ai-trend-
             | chinchilla-70b...
             | 
             | [2] https://github.com/FMInference/FlexGen
             | 
             | [3] https://ai.googleblog.com/2022/12/accelerating-text-
             | generati...
        
         | [deleted]
        
         | netruk44 wrote:
         | I'll preface by saying that Mac is _definitely_ not the
         | platform for deep learning currently. However! The M2 MacBook
         | Pros can optionally be equipped with 96 GB of RAM, all of which
         | can be accessed by the GPU.
         | 
         | Assuming that somebody, somewhere, is working on improving
         | things for Mac, we may very likely already have the hardware to
         | run at least a distilled version of ChatGPT locally on laptops.
         | (And if not the MBP, then the M1 Mac Studio would be a good
         | runner-up with 128 GB of memory, though that's obviously not a
         | laptop)
        
         | api wrote:
         | You can run giant models on a high-end laptop today... they'll
         | just be _slooooowww_ since you 'll be doing things like
         | swapping data in/out and leveraging the CPU. If you don't mind
         | waiting an hour for a prompt response it can work.
         | 
         | It's the same as it's always been. Any general purpose computer
         | is Turing complete. Spending more gets you faster results.
        
         | sebzim4500 wrote:
         | FlexGen can already run GPT-3 size models on commodity
         | hardware, albeit with high latency and fairly slow throughput
         | (order of 1 token/s).
        
         | l33tman wrote:
         | You can run this today on a consumer GPU at slow speed, using
         | swapping and 4-bit weights (which works surprisingly well and
         | is the new hot topic now)
        
         | CuriouslyC wrote:
         | Few points:
         | 
         | 1. Chinchilla has demonstrated models are currently
         | unnecessarily large, and would benefit more from data scaling.
         | 
         | 2. Models can be brought down in size massively by a
         | combination of distillation and quantization.
         | 
         | A GPT-3 equivalent with 50B parameters quantized to fp4 is
         | 200gig, and it could probably be distilled to half of that or
         | less while still being functional for the vast majority of
         | prompts. That means ~100gig memory will be a target for devices
         | in the near future.
         | 
         | Once large language models are the main thing people buy GPUs
         | (and even new computers) for, architectures will be redesigned
         | to improve gpu -> memory bandwidth and latency. I wouldn't be
         | surprised to see GPU integrated motherboards as a future
         | premium tier offering, we're already running into heat and
         | space issues with add-on cards and it should be possible to
         | build a low latency bus to a unified system memory.
        
           | sebzim4500 wrote:
           | >1. Chinchilla has demonstrated models are currently
           | unnecessarily large, and would benefit more from data
           | scaling.
           | 
           | Not convinced. It showed this for the original self
           | supervised task, but it might be true that the spare
           | parameters end up being useful for the later finetuning/RLHF
           | stages.
        
           | moyix wrote:
           | Many of these models are already being trained in FP16, and
           | FP8 seems likely now that the H100s support it.
        
       | thimm wrote:
       | I love whisper. It is so easy to use. I created a small pipeline
       | that transcribes podcasts within the domain that I'm working in.
       | It helps me and my colleagues to revisit and find podcasts
       | episodes without having to listen to them again. You can check it
       | out on podcasts.farmonapp.com
        
         | losthobbies wrote:
         | This is really interesting. I am looking at doing something
         | similar. Do you mind me asking what the Back-end API call is
         | written in? I had looked at Deepgram and might try putting a
         | small project together.
        
           | thimm wrote:
           | It is all written in Python and it uses the original python
           | bindings. I'm using mkdocs to convert the transcripts into a
           | website.
        
       | rileyphone wrote:
       | Great article, we are only just beginning to see the impact of
       | Whisper. I hope at least that it will trickle into my Alexa
       | sooner rather than later, but I've been scheming other uses for
       | it too. Dictation to notes so I can think out loud. Make
       | transcripts of talks that I would rather read than listen to. The
       | possibilities are endless.
        
       | ur-whale wrote:
       | https://github.com/ggerganov/whisper.cpp
        
       | braindead_in wrote:
       | Whisper is awesome, but managing it in production environment is
       | not easy. I am waiting OpenAI (or someone else) to offer a API
       | with a Real Time Factor of < 1. RTF is inference time/duration of
       | the file. We can really use a that.
        
         | garblegarble wrote:
         | Doesn't whisper.cpp already get you that? It takes ~6 seconds
         | per 30 second segment on an M1 Max with the Large model. Do you
         | mean you want snappy appearance of words shortly after you say
         | them, rather than having to recognise in 30 second segments?
        
         | qwertox wrote:
         | Getting a server running is easy if you use
         | https://github.com/ahmetoner/whisper-asr-webservice as a guide.
         | It's then a REST API which you post the file to and get the
         | transcription in return.
         | 
         | But I don't know what you consider being "in production". If
         | it's for internal use then it is enough.
         | 
         | Here are some comparisons of running it on GPU vs CPU According
         | to https://github.com/MiscellaneousStuff/openai-whisper-cpu the
         | medium model needs 1.7 seconds to transcribe 30 seconds of
         | audio when run on a GPU.
        
       | wyldfire wrote:
       | Aside - I'd love to see similar rust and zig implementations like
       | whisper.cpp.
       | 
       | I'll donate $150 USD to zig and rust foundations as a bounty for
       | respective MIT-licensed implementations of these. Let's keep it
       | simple - scalar instructions, no need for intrinsics/assembly.
       | Ideally there would be some tests.
       | 
       | whisper.cpp looks like a simple-enough-but-very practical
       | application and I think it would help promote these modern
       | languages to have a simple and portable demonstration like this.
        
         | [deleted]
        
         | [deleted]
        
         | fbdab103 wrote:
         | What does a Rust/Zig port buy that the current implementation
         | cannot do?
        
           | wyldfire wrote:
           | Nothing. I tried to address my desire for these items at the
           | end there - it would serve as an excellent demonstration of
           | the power of these languages (to take on a performance
           | critical task like audio transcription).
        
           | itake wrote:
           | Just taking a stab, but if your code is primarily written in
           | rust/zig, it's really annoying calling c/cpp libraries,
           | because you have to build them and keeping the bindings in
           | sync after updates.
        
       | tikkun wrote:
       | What utilities related to Whisper do you wish existed? What have
       | you had to build yourself?
       | 
       | On the end user application side, I wish there was something that
       | let me pick a podcast of my choosing, get it fully transcribed,
       | and get an embeddings search plus answer q&a on top of that
       | podcast or set of chosen podcasts. I've seen ones for specific
       | podcasts, but I'd like one where I can choose the podcast.
       | (Probably won't build it)
       | 
       | Also on the end user side, I wish there was an Otter alternative
       | (still paid $30/mo, but unlimited minutes per month) that had
       | longer transcription limits. (Started building this, not much
       | interest from users though)
       | 
       | Things I've seen on the dev tool side:
       | 
       | Gladia (API call version of Whisper)
       | 
       | Whisper.cpp
       | 
       | Whisper webservice (https://github.com/ahmetoner/whisper-asr-
       | webservice) - via this thread
       | 
       | Live microphone demo (not real time, it still does it in chunks)
       | https://github.com/mallorbc/whisper_mic
       | 
       | Streamlit UI https://github.com/hayabhay/whisper-ui
       | 
       | Whisper playground https://github.com/saharmor/whisper-playground
       | 
       | Real time whisper https://github.com/shirayu/whispering
       | 
       | Whisper as a service https://github.com/schibsted/WAAS
       | 
       | Improved timestamps and speaker identification
       | https://github.com/m-bain/whisperX
       | 
       | MacWhisper https://goodsnooze.gumroad.com/l/macwhisper
       | 
       | Crossplatform desktop Whisper that supports semi-realtime
       | https://github.com/chidiwilliams/buzz
        
         | gtirloni wrote:
         | I think AssemblyAI has tools for that.
        
         | thundergolfer wrote:
         | This demo lets you choose the podcast, and is open-source:
         | https://modal-labs--whisper-pod-transcriber-fastapi-app.moda...
         | 
         | https://github.com/modal-labs/modal-examples/tree/main/06_gp...
         | 
         | Transcribes 1hr of audio in roughly 1min, using parallelisation
         | across CPUs.
        
         | [deleted]
        
         | causi wrote:
         | Add sponsor-skipping into that. Give me a transcript, let me
         | select a series of words, then audio containing that series of
         | words gets skipped on all remaining episodes.
        
       | aaron695 wrote:
       | [dead]
        
       | nmfisher wrote:
       | Whisper is great, but I also wouldn't overlook other next-
       | generation models (RNN-T/Zipformer/etc) trained on 50k+ hour
       | datasets. These also perform very well.
       | 
       | That being said, Whisper is clearly a far cry from
       | "intelligence". This should be clear when you feed it 5 seconds
       | of silence and get hallucinated garbage in return. It's much more
       | akin to compressing those huge datasets into something that can
       | feasibly be run on recent hardware. That's not to downplay how
       | impressive that is, just to draw a clear line between
       | "compression" and "intelligence".
        
         | wyldfire wrote:
         | Don't state of the art commercial systems do something similar?
         | I assume there must be some automatic gain boosting the noise
         | at the frontend of most pipelines, I know I've gotten
         | transcribed voicemails that really just are silence but the
         | transcript shows lots and lots of hallucinated words.
         | 
         | Regardless of "intelligence" it's got real utility.
        
           | uncanneyvalley wrote:
           | It's not that your audio is being amplified, it's that the
           | VAD classifier is poorly tuned. The noise should never even
           | reach the recognition stage. Whisper's hallucinations are
           | pretty severe, but are improved by adding VAD to its
           | pipeline.
        
           | qwertox wrote:
           | I have an app on my phone which creates a 1 minute audio file
           | when I press a button. I have a lavalier microphone connected
           | to the phone and use it to record notes while riding my bike.
           | It's always 1 minute because that is usually enough, and if I
           | see that I need more time, I record an overlapping second
           | file.
           | 
           | Last week I set up a Whisper instance on my server and have
           | been feeding it with these files. The result is pretty good.
           | I usually can remember what I was saying when I read the
           | transcription, which usually contains a couple of errors.
           | Then there are those added hallucinations which are entire
           | sentences, like:
           | 
           | ----
           | 
           | 00:00.000 --> 00:05.000 Also temperaturmassig ist es recht
           | gut. [So temperature wise, it's pretty good.]
           | 
           | 00:05.000 --> 00:09.000 Der eine hat 12 Grad, der andere 10.
           | [One has 12 degrees, the other 10. (I have two temperature
           | sensors mounted on the bike, ESP32 streaming the data to the
           | phone via BLE)]
           | 
           | 00:09.000 --> 00:12.000 Also sagen wir mal, 10 Grad. [So
           | let's say 10 degrees.]
           | 
           | 00:14.000 --> 00:19.000 Es ist bewolkt und windig. [It's
           | cloudy and windy.]
           | 
           | 00:20.000 --> 00:24.000 Aber irgendwie vom Wetter her gut.
           | [But somehow from the weather it's good.]
           | 
           | 00:24.000 --> 00:31.000 Ich habe heute uberhaupt nichts
           | gegessen und sehr wenig getrunken. [I ate nothing at all
           | today and drank very little.]
           | 
           | 00:54.000 --> 00:59.000 Vielen Dank fur's Zuschauen! [Thanks
           | for watching!]
           | 
           | Transcribed in 77.2 seconds
           | 
           | ----
           | 
           | The last sentence, "Thanks for watching!" is a complete
           | hallucination. There were 30 seconds remaining which were me
           | breathing and the wind blowing into the microphone and it
           | came up with that comment.
           | 
           | I usually comment on the weather because I take note of what
           | I am wearing, and it allows me to better prepare for future
           | rides.
           | 
           | 77 seconds for the 60 second file because my server has no
           | GPU, so I'm running the large model on the CPU (in a VM which
           | has 8 cores assigned to it from a Ryzen 9 5950X). I've been
           | considering buying a small PC with a 3060 RTX only for
           | inferencing, but it may be too expensive. I tried Google
           | Speech-To-Text and it is nowhere as good as Whisper under
           | these conditions (having the wind noise and the heavy
           | breathing).
           | 
           | This is Google's result:
           | 
           | ----
           | 
           | "Also temperaturmassig es ist recht gut, der eine hat 12deg
           | andere 10. Es ist angemalte 10 Grad. Es ist bewolkt und
           | windig, aber er hat sie vom Wetter her gut, ich wollte
           | uberhaupt nichts gegessen und sehr wenig getrunken."
           | 
           | ["So temperature-wise it's pretty good, one has 12deg other
           | 10. It's painted 10 degrees. It's cloudy and windy, but he
           | has it good from the weather, I did not want to eat anything
           | at all and drank very little."]
           | 
           | ----
           | 
           | Also, whisper.cpp doesn't seem to generate the same results,
           | and they appear to be not so good (in this case it was almost
           | just as good). I just tested the same file on whisper.cpp
           | with the large model and it's even funnier:
           | 
           | ----
           | 
           | [00:00:00.000 --> 00:00:05.000] also temperaturmassig ist es
           | recht gut [...]
           | 
           | [00:00:05.000 --> 00:00:09.000] der eine hat 12 Grad, der
           | andere 10 [...]
           | 
           | [00:00:09.000 --> 00:00:12.000] also sagen wir mal so 10 Grad
           | [...]
           | 
           | [00:00:12.000 --> 00:00:19.000] es ist bewolkt und windig
           | [...]
           | 
           | [00:00:19.000 --> 00:00:24.000] aber irgendwie vom Wetter her
           | gut [...]
           | 
           | [00:00:27.000 --> 00:00:31.000] ich habe heute uberhaupt
           | nichts gegessen und sehr wenig getrunken [...]
           | 
           | [00:00:31.000 --> 00:00:35.000] das ist der Grund, warum ich
           | so viel auf dem Knie gehe [this is the reason why I go so
           | much on the knee]
           | 
           | [00:00:35.000 --> 00:00:39.000] das war's, bis zum nachsten
           | Mal! [that's it, until next time!]
           | 
           | [00:00:39.000 --> 00:00:59.000] Danke furs Zuschauen! [Thanks
           | for watching!]
           | 
           | `time` yields 567.63s user 1.99s system 755% cpu 1:15.36
           | total
           | 
           | ----
           | 
           | The first 30 seconds, where the text is clearly understood,
           | is inferenced within ~10-15 seconds. It's the "silence" which
           | makes the AI go crazy on the workload.
           | 
           | The idea behind this is to set up a system which then sends
           | me an email with a map and trail of the ride as well as the
           | transcriptions of the notes.
        
             | IanCal wrote:
             | Instead of setting up a machine for inference, try modal
             | labs (no affiliation):
             | https://modal.com/docs/guide/whisper-transcriber
             | 
             | Pay per second GPU processing, with an example of running
             | whisper over 10 GPUs in parallel.
        
               | qwertox wrote:
               | Interesting. I thought that with these offerings I had to
               | rent a VM with GPU and pay the hourly rate for as long as
               | a VM is running.
               | 
               | So this is really 0 USD when not in use? I'm also
               | intending to use this for transcribing my phone answering
               | machine recordings, so the transcription requests come in
               | at random times which means that the transcription
               | service should be constantly available.
        
               | IanCal wrote:
               | Most are, modal is a very different offering where it's
               | $0 when not in use. They have some other very interesting
               | ideas like charging you for CPU time rather than wall
               | time.
               | 
               | It's a newer business so I guess you should factor that
               | risk in though.
        
           | nmfisher wrote:
           | I occasionally get a single hallucinated word (more like a
           | mis-transcription) where the audio contains a
           | clunk/bang/cough/etc, but I've never had full hallucinated
           | phrases from clean silence.
           | 
           | There are a couple of GitHub discussions on the Whisper
           | repository with various fixes/hacks to deal with it:
           | https://github.com/openai/whisper/discussions/679
           | https://github.com/openai/whisper/discussions/813
           | 
           | If you get a chance, I encourage you to try out the other
           | newer models I mentioned, I think you'd be very impressed.
        
             | pixl97 wrote:
             | I don't see this much different than what commonly happens
             | with humans when we hear our named called when it was some
             | environmental noise.
             | 
             | As for the silence, I wonder why the the model even
             | receives it. I would think a lot of that would be
             | compressed out of existence to save bandwidth.
        
         | jasode wrote:
         | _> , just to draw a clear line between "compression" and
         | "intelligence"_
         | 
         | Not disagreeing with you but your sentence reminded of research
         | looking at the link between "compression" and "intelligence" :
         | 
         | https://www.google.com/search?q=compression+is+a+form+of+int...
        
         | panarky wrote:
         | It's a tool that's free as in freedom, and it is incredibly
         | useful.
         | 
         | How does discovering that it doesn't handle some weird use case
         | diminish its utility as a tool?
         | 
         | Hammers are awesome at driving nails into wood. But if you
         | strike the wood directly without a nail, the hammer puts a dent
         | in the wood. Is that a defect in the hammer? Does it somehow
         | make the hammer any less useful as a tool?
        
           | pessimizer wrote:
           | 5 seconds of silence isn't a weird use case.
        
           | nmfisher wrote:
           | Not sure what makes you think I'm diminishing its utility as
           | a tool; like I said, it's an incredible tool and I lean on it
           | very heavily for various speech-processing pipelines.
           | 
           | I'm just pointing out that Whisper definitely hasn't "solved"
           | speech recognition, and there's still a lot more fertile
           | ground to cover from a research perspective.
        
             | pixl97 wrote:
             | In my personal believe I don't think speech recognition, at
             | least based on any human model is solvable, at best we can
             | get 'as good as an average set of humans that speak the
             | language'
             | 
             | People mumble crap all the time to other humans and need to
             | repeat what they say with proper enunciation.
             | 
             | People have hearing problems, which would correlate to
             | microphone quality/placement issues when dealing with
             | computer systems.
             | 
             | Then there are issues where people say one thing
             | incorrectly, but the person following/listening to the
             | directions knows the procedure and does the correct thing.
             | If you asked the speaker what they said again, they'd say
             | they said the 'correct' thing in the first place.
             | 
             | And this is something that I've done before as an example.
             | 
             | Me: "Click the X button to start the process"
             | 
             | Person writing the notes: "Click the Y button to start the
             | process"
             | 
             | Person writing the notes: You meant click the Y button
             | right.
             | 
             | Me: Yea, that's what I said.
             | 
             | Oops, its when we get in things like this that we run into
             | the unsolvable speech recognition issues because we don't
             | generally understand our own error bars on what we say. The
             | speech quality between a public speaker and the average joe
             | I'm sure has a very wide range.
        
               | panarky wrote:
               | _> at best we can get as good as an average set of
               | humans_
               | 
               | Whisper is already superhuman, more accurate than
               | experienced human transcribers.
        
       | krmbzds wrote:
       | https://archive.is/FEGf1
        
       | mark_l_watson wrote:
       | I really like the Sutton quote. Manually written AI systems show
       | promising early results and then fail when compared to machine
       | learning approaches.
        
       | EZ-Cheeze wrote:
       | "There could be even larger changes--we talk a lot, and almost
       | all of it goes into the ether. What if people recorded
       | conversations as a matter of course, made transcripts, and
       | referred back to them the way we now look back to old texts or
       | e-mails?"
       | 
       | Minority Report-level awesome futurism
       | 
       | One thing that changes everything
        
         | cloudking wrote:
         | This exists already https://www.rewind.ai/
        
           | anonyfox wrote:
           | I had no idea. instantly subscribed. not joking.
        
       ___________________________________________________________________
       (page generated 2023-02-21 23:03 UTC)