[HN Gopher] Whispers of A.I.'s Modular Future
___________________________________________________________________
Whispers of A.I.'s Modular Future
Author : pgayed
Score : 143 points
Date : 2023-02-21 05:23 UTC (17 hours ago)
(HTM) web link (www.newyorker.com)
(TXT) w3m dump (www.newyorker.com)
| ilaksh wrote:
| Whisper is great, but Google's built-in speech-to-text thing in
| Chrome with https://developer.mozilla.org/en-
| US/docs/Web/API/Web_Speech_... has been available for quite some
| time and works well. Not sure if Firefox has a implementation,
| probably one that works locally but not nearly as well.
|
| Obviously not having to send all of the data to Google is a big
| deal. But practically speaking the Google recognition seems to
| perform extremely well. So not sure this is really a new
| capability for people if they were willing to use Google's
| servers.
|
| It seems like Firefox and Chrome should now ship with Whisper
| built in. Or they should work on that.. and if Chrome doesn't add
| it then.. that's suspicious.
| qwertox wrote:
| You may want to read my comment which I posted to another
| comment, it has a comparison to Google's ASR.
| krisoft wrote:
| > So not sure this is really a new capability for people if
| they were willing to use Google's servers.
|
| Of course it is. Google is providing the API at their largess.
| One day they can decide they won't do it anymore, or they will
| charge you huge piles of money, or they decide they don't like
| something about your data and they won't process it even for
| all the money. Or will do it but only send back nonsense to
| you. It is their endpoint and they do whatever they are pleased
| with it.
|
| On the other hand you can take Whisper and use it on your own
| computer. Or rent a computer from anyone who does that kind of
| service. If you don't like how it performs you can refine it on
| your own datasets. If you have tens of thousands hours of sound
| files you want to convert you can just calculate what it will
| cost you to convert with Whisper. With google? Good luck.
|
| The difference between running Whisper on your own hardware vs
| using Google's endpoint is the difference between having your
| own car vs the mercurial rich guy not minding if skaters hold
| on his taillights to hitch a ride for now.
| jbotz wrote:
| The article concludes:
|
| "Eventually, though, someone will release a program that's nearly
| as capable as ChatGPT, and entirely open-source. An enterprising
| amateur will find a way to make it run for free on your laptop."
|
| Well not in the near future... Large language models like ChatGPT
| are called "large" for a reason, and they're too big for your
| laptop for the foreseeable future. I think you'd need a computer
| with several high-end GPUs that each have 128+GB of RAM to be
| able to run one of those. Maybe a laptop from 2030 will do.
|
| Otherwise a very good article.
| wyldfire wrote:
| > to be able to run one of those
|
| It's very important to distinguish the different use cases
| between training and inference. The amount of memory required
| to execute the ChatGPT model, once trained, is likely much,
| much less than 128GB.
| dxuh wrote:
| Afaik you actually need much more. About 400GB just to load
| the trained model according to this tweet thread: https://twi
| tter.com/tomgoldsteincs/status/160019698195510069... I am not
| quite sure how reliable the source is, but it makes sense
| that you at least need to store the 175 billion parameters
| that define the model in VRAM. I know that for a short-ish
| while GPUDirect storage is a thing, so that could help for
| sure, but it would definitely impact execution time as well.
| moyix wrote:
| That thread makes a bunch of assumptions that seem a bit
| dubious to me. We've known since Chinchilla that you don't
| need 175B parameters to get GPT-3 quality - a 70B model can
| outperform GPT3 [1]. And his numbers assume the model is
| loaded into GPU memory in FP16 (175B*2 = 350GB), but people
| have shown you can quantize down to 8-bit (and in some
| cases 4 bit) with almost no performance loss. So in 8-bit
| precision with a 70B model you need ~70GB of VRAM, which
| you can get with two A6000s on a desktop (each 48GB).
|
| And finally there are lots of other ways to get this down.
| Aside from quantization, people have also shown that you
| can do pruning - getting rid of many of the weights - again
| without much perf loss. You can also offload the weights to
| CPU RAM or an NVME and stream them in as needed [2]; it's
| slower but if you arrange things right the performance is
| not too bad. There are also ways to speed up inference
| using techniques like early exit [3], where you can skip
| running the whole model for some tokens that are easy to
| predict.
|
| Overall it feels like within a year or two a combination of
| better quantization/pruning, improved understanding of how
| to train smaller LLMs, and hardware improvements will put
| inference for ChatGPT-style models within reach of the
| average user.
|
| [1] https://towardsdatascience.com/a-new-ai-trend-
| chinchilla-70b...
|
| [2] https://github.com/FMInference/FlexGen
|
| [3] https://ai.googleblog.com/2022/12/accelerating-text-
| generati...
| [deleted]
| netruk44 wrote:
| I'll preface by saying that Mac is _definitely_ not the
| platform for deep learning currently. However! The M2 MacBook
| Pros can optionally be equipped with 96 GB of RAM, all of which
| can be accessed by the GPU.
|
| Assuming that somebody, somewhere, is working on improving
| things for Mac, we may very likely already have the hardware to
| run at least a distilled version of ChatGPT locally on laptops.
| (And if not the MBP, then the M1 Mac Studio would be a good
| runner-up with 128 GB of memory, though that's obviously not a
| laptop)
| api wrote:
| You can run giant models on a high-end laptop today... they'll
| just be _slooooowww_ since you 'll be doing things like
| swapping data in/out and leveraging the CPU. If you don't mind
| waiting an hour for a prompt response it can work.
|
| It's the same as it's always been. Any general purpose computer
| is Turing complete. Spending more gets you faster results.
| sebzim4500 wrote:
| FlexGen can already run GPT-3 size models on commodity
| hardware, albeit with high latency and fairly slow throughput
| (order of 1 token/s).
| l33tman wrote:
| You can run this today on a consumer GPU at slow speed, using
| swapping and 4-bit weights (which works surprisingly well and
| is the new hot topic now)
| CuriouslyC wrote:
| Few points:
|
| 1. Chinchilla has demonstrated models are currently
| unnecessarily large, and would benefit more from data scaling.
|
| 2. Models can be brought down in size massively by a
| combination of distillation and quantization.
|
| A GPT-3 equivalent with 50B parameters quantized to fp4 is
| 200gig, and it could probably be distilled to half of that or
| less while still being functional for the vast majority of
| prompts. That means ~100gig memory will be a target for devices
| in the near future.
|
| Once large language models are the main thing people buy GPUs
| (and even new computers) for, architectures will be redesigned
| to improve gpu -> memory bandwidth and latency. I wouldn't be
| surprised to see GPU integrated motherboards as a future
| premium tier offering, we're already running into heat and
| space issues with add-on cards and it should be possible to
| build a low latency bus to a unified system memory.
| sebzim4500 wrote:
| >1. Chinchilla has demonstrated models are currently
| unnecessarily large, and would benefit more from data
| scaling.
|
| Not convinced. It showed this for the original self
| supervised task, but it might be true that the spare
| parameters end up being useful for the later finetuning/RLHF
| stages.
| moyix wrote:
| Many of these models are already being trained in FP16, and
| FP8 seems likely now that the H100s support it.
| thimm wrote:
| I love whisper. It is so easy to use. I created a small pipeline
| that transcribes podcasts within the domain that I'm working in.
| It helps me and my colleagues to revisit and find podcasts
| episodes without having to listen to them again. You can check it
| out on podcasts.farmonapp.com
| losthobbies wrote:
| This is really interesting. I am looking at doing something
| similar. Do you mind me asking what the Back-end API call is
| written in? I had looked at Deepgram and might try putting a
| small project together.
| thimm wrote:
| It is all written in Python and it uses the original python
| bindings. I'm using mkdocs to convert the transcripts into a
| website.
| rileyphone wrote:
| Great article, we are only just beginning to see the impact of
| Whisper. I hope at least that it will trickle into my Alexa
| sooner rather than later, but I've been scheming other uses for
| it too. Dictation to notes so I can think out loud. Make
| transcripts of talks that I would rather read than listen to. The
| possibilities are endless.
| ur-whale wrote:
| https://github.com/ggerganov/whisper.cpp
| braindead_in wrote:
| Whisper is awesome, but managing it in production environment is
| not easy. I am waiting OpenAI (or someone else) to offer a API
| with a Real Time Factor of < 1. RTF is inference time/duration of
| the file. We can really use a that.
| garblegarble wrote:
| Doesn't whisper.cpp already get you that? It takes ~6 seconds
| per 30 second segment on an M1 Max with the Large model. Do you
| mean you want snappy appearance of words shortly after you say
| them, rather than having to recognise in 30 second segments?
| qwertox wrote:
| Getting a server running is easy if you use
| https://github.com/ahmetoner/whisper-asr-webservice as a guide.
| It's then a REST API which you post the file to and get the
| transcription in return.
|
| But I don't know what you consider being "in production". If
| it's for internal use then it is enough.
|
| Here are some comparisons of running it on GPU vs CPU According
| to https://github.com/MiscellaneousStuff/openai-whisper-cpu the
| medium model needs 1.7 seconds to transcribe 30 seconds of
| audio when run on a GPU.
| wyldfire wrote:
| Aside - I'd love to see similar rust and zig implementations like
| whisper.cpp.
|
| I'll donate $150 USD to zig and rust foundations as a bounty for
| respective MIT-licensed implementations of these. Let's keep it
| simple - scalar instructions, no need for intrinsics/assembly.
| Ideally there would be some tests.
|
| whisper.cpp looks like a simple-enough-but-very practical
| application and I think it would help promote these modern
| languages to have a simple and portable demonstration like this.
| [deleted]
| [deleted]
| fbdab103 wrote:
| What does a Rust/Zig port buy that the current implementation
| cannot do?
| wyldfire wrote:
| Nothing. I tried to address my desire for these items at the
| end there - it would serve as an excellent demonstration of
| the power of these languages (to take on a performance
| critical task like audio transcription).
| itake wrote:
| Just taking a stab, but if your code is primarily written in
| rust/zig, it's really annoying calling c/cpp libraries,
| because you have to build them and keeping the bindings in
| sync after updates.
| tikkun wrote:
| What utilities related to Whisper do you wish existed? What have
| you had to build yourself?
|
| On the end user application side, I wish there was something that
| let me pick a podcast of my choosing, get it fully transcribed,
| and get an embeddings search plus answer q&a on top of that
| podcast or set of chosen podcasts. I've seen ones for specific
| podcasts, but I'd like one where I can choose the podcast.
| (Probably won't build it)
|
| Also on the end user side, I wish there was an Otter alternative
| (still paid $30/mo, but unlimited minutes per month) that had
| longer transcription limits. (Started building this, not much
| interest from users though)
|
| Things I've seen on the dev tool side:
|
| Gladia (API call version of Whisper)
|
| Whisper.cpp
|
| Whisper webservice (https://github.com/ahmetoner/whisper-asr-
| webservice) - via this thread
|
| Live microphone demo (not real time, it still does it in chunks)
| https://github.com/mallorbc/whisper_mic
|
| Streamlit UI https://github.com/hayabhay/whisper-ui
|
| Whisper playground https://github.com/saharmor/whisper-playground
|
| Real time whisper https://github.com/shirayu/whispering
|
| Whisper as a service https://github.com/schibsted/WAAS
|
| Improved timestamps and speaker identification
| https://github.com/m-bain/whisperX
|
| MacWhisper https://goodsnooze.gumroad.com/l/macwhisper
|
| Crossplatform desktop Whisper that supports semi-realtime
| https://github.com/chidiwilliams/buzz
| gtirloni wrote:
| I think AssemblyAI has tools for that.
| thundergolfer wrote:
| This demo lets you choose the podcast, and is open-source:
| https://modal-labs--whisper-pod-transcriber-fastapi-app.moda...
|
| https://github.com/modal-labs/modal-examples/tree/main/06_gp...
|
| Transcribes 1hr of audio in roughly 1min, using parallelisation
| across CPUs.
| [deleted]
| causi wrote:
| Add sponsor-skipping into that. Give me a transcript, let me
| select a series of words, then audio containing that series of
| words gets skipped on all remaining episodes.
| aaron695 wrote:
| [dead]
| nmfisher wrote:
| Whisper is great, but I also wouldn't overlook other next-
| generation models (RNN-T/Zipformer/etc) trained on 50k+ hour
| datasets. These also perform very well.
|
| That being said, Whisper is clearly a far cry from
| "intelligence". This should be clear when you feed it 5 seconds
| of silence and get hallucinated garbage in return. It's much more
| akin to compressing those huge datasets into something that can
| feasibly be run on recent hardware. That's not to downplay how
| impressive that is, just to draw a clear line between
| "compression" and "intelligence".
| wyldfire wrote:
| Don't state of the art commercial systems do something similar?
| I assume there must be some automatic gain boosting the noise
| at the frontend of most pipelines, I know I've gotten
| transcribed voicemails that really just are silence but the
| transcript shows lots and lots of hallucinated words.
|
| Regardless of "intelligence" it's got real utility.
| uncanneyvalley wrote:
| It's not that your audio is being amplified, it's that the
| VAD classifier is poorly tuned. The noise should never even
| reach the recognition stage. Whisper's hallucinations are
| pretty severe, but are improved by adding VAD to its
| pipeline.
| qwertox wrote:
| I have an app on my phone which creates a 1 minute audio file
| when I press a button. I have a lavalier microphone connected
| to the phone and use it to record notes while riding my bike.
| It's always 1 minute because that is usually enough, and if I
| see that I need more time, I record an overlapping second
| file.
|
| Last week I set up a Whisper instance on my server and have
| been feeding it with these files. The result is pretty good.
| I usually can remember what I was saying when I read the
| transcription, which usually contains a couple of errors.
| Then there are those added hallucinations which are entire
| sentences, like:
|
| ----
|
| 00:00.000 --> 00:05.000 Also temperaturmassig ist es recht
| gut. [So temperature wise, it's pretty good.]
|
| 00:05.000 --> 00:09.000 Der eine hat 12 Grad, der andere 10.
| [One has 12 degrees, the other 10. (I have two temperature
| sensors mounted on the bike, ESP32 streaming the data to the
| phone via BLE)]
|
| 00:09.000 --> 00:12.000 Also sagen wir mal, 10 Grad. [So
| let's say 10 degrees.]
|
| 00:14.000 --> 00:19.000 Es ist bewolkt und windig. [It's
| cloudy and windy.]
|
| 00:20.000 --> 00:24.000 Aber irgendwie vom Wetter her gut.
| [But somehow from the weather it's good.]
|
| 00:24.000 --> 00:31.000 Ich habe heute uberhaupt nichts
| gegessen und sehr wenig getrunken. [I ate nothing at all
| today and drank very little.]
|
| 00:54.000 --> 00:59.000 Vielen Dank fur's Zuschauen! [Thanks
| for watching!]
|
| Transcribed in 77.2 seconds
|
| ----
|
| The last sentence, "Thanks for watching!" is a complete
| hallucination. There were 30 seconds remaining which were me
| breathing and the wind blowing into the microphone and it
| came up with that comment.
|
| I usually comment on the weather because I take note of what
| I am wearing, and it allows me to better prepare for future
| rides.
|
| 77 seconds for the 60 second file because my server has no
| GPU, so I'm running the large model on the CPU (in a VM which
| has 8 cores assigned to it from a Ryzen 9 5950X). I've been
| considering buying a small PC with a 3060 RTX only for
| inferencing, but it may be too expensive. I tried Google
| Speech-To-Text and it is nowhere as good as Whisper under
| these conditions (having the wind noise and the heavy
| breathing).
|
| This is Google's result:
|
| ----
|
| "Also temperaturmassig es ist recht gut, der eine hat 12deg
| andere 10. Es ist angemalte 10 Grad. Es ist bewolkt und
| windig, aber er hat sie vom Wetter her gut, ich wollte
| uberhaupt nichts gegessen und sehr wenig getrunken."
|
| ["So temperature-wise it's pretty good, one has 12deg other
| 10. It's painted 10 degrees. It's cloudy and windy, but he
| has it good from the weather, I did not want to eat anything
| at all and drank very little."]
|
| ----
|
| Also, whisper.cpp doesn't seem to generate the same results,
| and they appear to be not so good (in this case it was almost
| just as good). I just tested the same file on whisper.cpp
| with the large model and it's even funnier:
|
| ----
|
| [00:00:00.000 --> 00:00:05.000] also temperaturmassig ist es
| recht gut [...]
|
| [00:00:05.000 --> 00:00:09.000] der eine hat 12 Grad, der
| andere 10 [...]
|
| [00:00:09.000 --> 00:00:12.000] also sagen wir mal so 10 Grad
| [...]
|
| [00:00:12.000 --> 00:00:19.000] es ist bewolkt und windig
| [...]
|
| [00:00:19.000 --> 00:00:24.000] aber irgendwie vom Wetter her
| gut [...]
|
| [00:00:27.000 --> 00:00:31.000] ich habe heute uberhaupt
| nichts gegessen und sehr wenig getrunken [...]
|
| [00:00:31.000 --> 00:00:35.000] das ist der Grund, warum ich
| so viel auf dem Knie gehe [this is the reason why I go so
| much on the knee]
|
| [00:00:35.000 --> 00:00:39.000] das war's, bis zum nachsten
| Mal! [that's it, until next time!]
|
| [00:00:39.000 --> 00:00:59.000] Danke furs Zuschauen! [Thanks
| for watching!]
|
| `time` yields 567.63s user 1.99s system 755% cpu 1:15.36
| total
|
| ----
|
| The first 30 seconds, where the text is clearly understood,
| is inferenced within ~10-15 seconds. It's the "silence" which
| makes the AI go crazy on the workload.
|
| The idea behind this is to set up a system which then sends
| me an email with a map and trail of the ride as well as the
| transcriptions of the notes.
| IanCal wrote:
| Instead of setting up a machine for inference, try modal
| labs (no affiliation):
| https://modal.com/docs/guide/whisper-transcriber
|
| Pay per second GPU processing, with an example of running
| whisper over 10 GPUs in parallel.
| qwertox wrote:
| Interesting. I thought that with these offerings I had to
| rent a VM with GPU and pay the hourly rate for as long as
| a VM is running.
|
| So this is really 0 USD when not in use? I'm also
| intending to use this for transcribing my phone answering
| machine recordings, so the transcription requests come in
| at random times which means that the transcription
| service should be constantly available.
| IanCal wrote:
| Most are, modal is a very different offering where it's
| $0 when not in use. They have some other very interesting
| ideas like charging you for CPU time rather than wall
| time.
|
| It's a newer business so I guess you should factor that
| risk in though.
| nmfisher wrote:
| I occasionally get a single hallucinated word (more like a
| mis-transcription) where the audio contains a
| clunk/bang/cough/etc, but I've never had full hallucinated
| phrases from clean silence.
|
| There are a couple of GitHub discussions on the Whisper
| repository with various fixes/hacks to deal with it:
| https://github.com/openai/whisper/discussions/679
| https://github.com/openai/whisper/discussions/813
|
| If you get a chance, I encourage you to try out the other
| newer models I mentioned, I think you'd be very impressed.
| pixl97 wrote:
| I don't see this much different than what commonly happens
| with humans when we hear our named called when it was some
| environmental noise.
|
| As for the silence, I wonder why the the model even
| receives it. I would think a lot of that would be
| compressed out of existence to save bandwidth.
| jasode wrote:
| _> , just to draw a clear line between "compression" and
| "intelligence"_
|
| Not disagreeing with you but your sentence reminded of research
| looking at the link between "compression" and "intelligence" :
|
| https://www.google.com/search?q=compression+is+a+form+of+int...
| panarky wrote:
| It's a tool that's free as in freedom, and it is incredibly
| useful.
|
| How does discovering that it doesn't handle some weird use case
| diminish its utility as a tool?
|
| Hammers are awesome at driving nails into wood. But if you
| strike the wood directly without a nail, the hammer puts a dent
| in the wood. Is that a defect in the hammer? Does it somehow
| make the hammer any less useful as a tool?
| pessimizer wrote:
| 5 seconds of silence isn't a weird use case.
| nmfisher wrote:
| Not sure what makes you think I'm diminishing its utility as
| a tool; like I said, it's an incredible tool and I lean on it
| very heavily for various speech-processing pipelines.
|
| I'm just pointing out that Whisper definitely hasn't "solved"
| speech recognition, and there's still a lot more fertile
| ground to cover from a research perspective.
| pixl97 wrote:
| In my personal believe I don't think speech recognition, at
| least based on any human model is solvable, at best we can
| get 'as good as an average set of humans that speak the
| language'
|
| People mumble crap all the time to other humans and need to
| repeat what they say with proper enunciation.
|
| People have hearing problems, which would correlate to
| microphone quality/placement issues when dealing with
| computer systems.
|
| Then there are issues where people say one thing
| incorrectly, but the person following/listening to the
| directions knows the procedure and does the correct thing.
| If you asked the speaker what they said again, they'd say
| they said the 'correct' thing in the first place.
|
| And this is something that I've done before as an example.
|
| Me: "Click the X button to start the process"
|
| Person writing the notes: "Click the Y button to start the
| process"
|
| Person writing the notes: You meant click the Y button
| right.
|
| Me: Yea, that's what I said.
|
| Oops, its when we get in things like this that we run into
| the unsolvable speech recognition issues because we don't
| generally understand our own error bars on what we say. The
| speech quality between a public speaker and the average joe
| I'm sure has a very wide range.
| panarky wrote:
| _> at best we can get as good as an average set of
| humans_
|
| Whisper is already superhuman, more accurate than
| experienced human transcribers.
| krmbzds wrote:
| https://archive.is/FEGf1
| mark_l_watson wrote:
| I really like the Sutton quote. Manually written AI systems show
| promising early results and then fail when compared to machine
| learning approaches.
| EZ-Cheeze wrote:
| "There could be even larger changes--we talk a lot, and almost
| all of it goes into the ether. What if people recorded
| conversations as a matter of course, made transcripts, and
| referred back to them the way we now look back to old texts or
| e-mails?"
|
| Minority Report-level awesome futurism
|
| One thing that changes everything
| cloudking wrote:
| This exists already https://www.rewind.ai/
| anonyfox wrote:
| I had no idea. instantly subscribed. not joking.
___________________________________________________________________
(page generated 2023-02-21 23:03 UTC)