[HN Gopher] Whisper - open source speech recognition by OpenAI
___________________________________________________________________
Whisper - open source speech recognition by OpenAI
Author : _just7_
Score : 850 points
Date : 2022-09-21 16:16 UTC (6 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| wongarsu wrote:
| > About a third of Whisper's audio dataset is non-English, and it
| is alternately given the task of transcribing in the original
| language or translating to English. We find this approach is
| particularly effective at learning speech to text translation and
| outperforms the supervised SOTA on CoVoST2 to English translation
| zero-shot.
|
| That's intriguing. You can just set the model to transcribe
| everything into English, no matter which language the speaker is
| using, and it just works. Given that many people are much better
| at understanding English than at speaking it, this might make
| voice interfaces much more accessible without much work.
| FloatArtifact wrote:
| This would be a cool thing to integrate into Dragonfly
| https://github.com/dictation-toolbox/dragonfly
| rexreed wrote:
| I'd love to find a way to test this with longer audio but I don't
| have GPU resources and not exactly sure how to load that into the
| Colab. Is anyone planning on hosting or sharing a model that can
| be used by others to test longer form audio (for podcast
| transcription)?
| londons_explore wrote:
| I've never seen transcription and translation combined into a
| single step like this before...
|
| Have I been living under a rock, or is this new?
|
| I assume it should help performance, because it means emphasis,
| timing and tone can be used to inform the translation. Helps make
| better guesses about information missing from the source
| language.
| jerpint wrote:
| I recorded myself speaking French and was able to translate
| decently well on my laptop. Very impressive!
| jfoster wrote:
| It seems like OpenAI are finally living up to their name for once
| with this release? Anything I'm missing?
|
| From what I can gather:
|
| 1. Includes model weights. I can't find the URL, but they
| reference them enough and have a CLI tool, so I presume I just
| haven't found them yet.
|
| 2. Includes code: https://github.com/openai/whisper
|
| 3. Released under MIT License:
| https://github.com/openai/whisper/blob/main/LICENSE
| thesausageking wrote:
| It's one model and in a non-strategic area where there are
| existing open source projects (Kaldi, DeepSpeech, ...).
|
| For a company that raised $1B, that's not exactly living up to
| their name and original mission.
| whimsicalism wrote:
| > It's one model and in a non-strategic area where there are
| existing open source projects (Kaldi, DeepSpeech, ...).
|
| I can already tell this is much better than any of the
| existing open source projects with the exception of the wav2*
| sequence of projects and potentially nvidia's nemo.
| StevenWaterman wrote:
| (Model weights from
| https://github.com/openai/whisper/blob/main/whisper/__init__...
| )
|
| "tiny.en": "https://openaipublic.azureedge.net/main/whisper/mod
| els/d3dd5..."
|
| "tiny": "https://openaipublic.azureedge.net/main/whisper/models
| /65147..."
|
| "base.en": "https://openaipublic.azureedge.net/main/whisper/mod
| els/25a85..."
|
| "base": "https://openaipublic.azureedge.net/main/whisper/models
| /ed3a0..."
|
| "small.en": "https://openaipublic.azureedge.net/main/whisper/mo
| dels/f953a..."
|
| "small": "https://openaipublic.azureedge.net/main/whisper/model
| s/9ecf7..."
|
| "medium.en": "https://openaipublic.azureedge.net/main/whisper/m
| odels/d7440..."
|
| "medium": "https://openaipublic.azureedge.net/main/whisper/mode
| ls/345ae..."
|
| "large": "https://openaipublic.azureedge.net/main/whisper/model
| s/e4b87..."
| mmastrac wrote:
| Large is 3GB to save everyone a click. Tiny is 72MB.
| anigbrowl wrote:
| That's unexpectedly lightweight - enough to run in some
| phones.
| solarmist wrote:
| This kind of model is harder to abuse, so I guess it passed
| their internal checks much more easily.
|
| I can understand not releasing GPT-3, even if I disagree with
| the decision.
| ignoramous wrote:
| > _This kind of model is harder to abuse, so I guess it
| passed their internal checks much more easily._
|
| The version I choose to believe: _stability.ai_ ate DALL-E
| for lunch, and that woke them up.
| solarmist wrote:
| This is probably also true.
| jfoster wrote:
| True. The potential of GPT-3 to cause internet mayhem was/is
| significant. I would argue that the mere act of announcing it
| was still a catalyst for an eventual GPT-3-like model being
| released. In revealing it, they established a target for what
| open source models could aim to achieve, and simultaneously
| got bad actors thinking about ways to abuse it.
| dwohnitmok wrote:
| > I can understand not releasing GPT-3, even if I disagree
| with the decision.
|
| Why do you disagree?
| bigyikes wrote:
| I don't see how GPT-3 is any more dangerous than Stable
| Diffusion, Photoshop, that fake news website the crazy
| person you're friends with on Facebook really likes, or any
| of the number of other tools and services that can be used
| to generate or spread fake information.
| jfoster wrote:
| All of your examples are limited in some way, but GPT-3
| wouldn't have any meaningful limits.
|
| Stable Diffusion: Marks images as AI-generated.
| (invisible watermark, but still, it's there)
|
| Photoshop: Requires time & effort from a human.
|
| Fake news website: Requires time & effort from a human.
| xkapastel wrote:
| I wouldn't really say Stable Diffusion marks images as
| AI-generated. There's a script in the Stable Diffusion
| repository that will do that, but it's not connected to
| the model itself in a meaningful way. I use Stable
| Diffusion a lot and I've never touched this script.
|
| https://github.com/CompVis/stable-
| diffusion/blob/69ae4b35e0a...
| capableweb wrote:
| What "script" are you using for doing txt2img? The
| watermark function is automatically called when you use
| the CLI in two places, https://github.com/CompVis/stable-
| diffusion/blob/69ae4b35e0a... and
| https://github.com/CompVis/stable-
| diffusion/blob/69ae4b35e0a...
|
| Trivial to remove, I give you that. But AFAIK, the
| original repository + most forks put the watermark
| automatically unless you've removed it on your own.
| spullara wrote:
| SD only does that if you don't delete the line of code
| that does it...
| mmh0000 wrote:
| Because why should the wealthy and connected be the only
| ones -allowed- have access to such life improving
| technology?
| solarmist wrote:
| Two reasons. First, someone else will release something
| similar. Second, I didn't see a related push from them to
| work with other in the industry to do something productive
| towards safety with the time they got by delaying
| availability of these kinds of models. So it felt
| disingenuous.
| bredren wrote:
| This is dropping right in the middle of Interspeech 2022.
|
| I don't believe OpenAI has anyone presenting at the conference,
| so presumably this was timed to coincide with that and get buzz
| at the conference.
|
| Curious how this model compares with foss STT from the startup
| Coqui.
| revskill wrote:
| It's actually better than Google Meet subtitle system.
| blueberrychpstx wrote:
| This is absolute garbage python as I am neither a python
| developer, nor a good developer. I was trying to play around with
| real time transcriptions. However, it does work!
|
| > * recording * done recording Recording saved to file.wav Press
| enter to transcribe
|
| /Users/laptop/Development/Personal/Public/pythonProject1/venv/lib
| /python3.9/site-packages/whisper/transcribe.py:70: UserWarning:
| FP16 is not supported on CPU; using FP32 instead
| warnings.warn("FP16 is not supported on CPU; using FP32 instead")
| Detected language: english Goodbye, I need to go pick up my wife.
| Press enter to start recording
|
| Any improvements welcome here.
|
| ``` # This is a sample Python script.
|
| # Press ^R to execute it or replace it with your code. # Press
| Double | to search everywhere for classes, files, tool windows,
| actions, and settings.
|
| def print_hi(name): # Use a breakpoint in the code line below to
| debug your script. print(f'Hi, {name}') # Press [?]F8 to toggle
| the breakpoint.
|
| def record_microphone(seconds): import pyaudio import wave
| CHUNK = 1024 FORMAT = pyaudio.paInt16 CHANNELS =
| 1 RATE = 44100 RECORD_SECONDS = seconds
| WAVE_OUTPUT_FILENAME = "file.wav" p =
| pyaudio.PyAudio() stream = p.open(format=FORMAT,
| channels=CHANNELS, rate=RATE,
| input=True, frames_per_buffer=CHUNK)
| print("* recording") frames = [] for i
| in range(0, int(RATE / CHUNK * RECORD_SECONDS)): data
| = stream.read(CHUNK) frames.append(data)
| print("* done recording") stream.stop_stream()
| stream.close() p.terminate() wf =
| wave.open(WAVE_OUTPUT_FILENAME, 'wb')
| wf.setnchannels(CHANNELS)
| wf.setsampwidth(p.get_sample_size(FORMAT))
| wf.setframerate(RATE) wf.writeframes(b''.join(frames))
| wf.close() return WAVE_OUTPUT_FILENAME
|
| if __name__ == '__main__': seconds = 5 while True: print("Press
| enter to start recording") input() filename =
| record_microphone(seconds) print("Recording saved to " +
| filename) print("Press enter to transcribe") input() import
| whisper model = whisper.load_model("base")
| result = model.transcribe(filename)
| print(result["text"])
|
| ```
| yawnxyz wrote:
| Oh man I remember LOVING Micro Machines as a kid.
|
| But also, this tool seems much better than Otter.ai, which gets
| every third word wrong when transcribing microbiology recordings
| alexb_ wrote:
| Combine the translation + transcription with voice synthesis, and
| once compute power allows for this to be miniaturized we will be
| able to have babel-fish technology in real life.
| no1youknowz wrote:
| This is awesome. But I really want the other way.
|
| To be able to give it text and hear the speech. A TTS (text to
| speech).
|
| As a language learner, the ability to create my own sentences
| (based on existing ones I have, in changing a word here or
| there). Would be amazing.
|
| How long till we have this I wonder. I know I could use a service
| to do this currently. But having something running locally, I'd
| prefer.
|
| Hopefully someone in the OpenAI team reads this. :)
| TaylorAlexander wrote:
| I suspect this is coming. I mean we do have decent text to
| speech systems already, but in this vein of "we used neural
| networks and now it's very very good" you can imagine that with
| something like GPT-3, to extend it they could use this speech
| to text system so you could speak to it for input, and then a
| natural progression is that it can use text to speech to return
| the output, so you just have a voice oriented conversational
| system.
|
| So I think TTS is a logical part of the system. I also think
| that there are peculiarities of voice interaction that aren't
| captured in text training datasets, so they would need to do
| some fine tuning on actual voice conversation to make it feel
| natural.
|
| All in due time I suppose.
| noreally_ wrote:
| A notebook is available to try with your microphone on Colab
| here: https://colab.research.google.com/drive/1nBZ-
| pDIaIi3N1DIIXvJ...
|
| I'm surprised by the quality on non-English languages, given that
| 80+% of the training data is English, and the rest is split
| between tens of languages.
| bambax wrote:
| Thanks! I played with this in French and posted the results as
| replies to this comment:
| https://news.ycombinator.com/item?id=32928643
|
| It's sometimes close to perfect, and sometimes goes off the
| rail; I think that maybe the model tries to establish some sort
| of consistency for each sentence; if starts wrong for the first
| few words of a sentence, it can't build the rest properly.
|
| But it's super fun.
| goffi wrote:
| Really interesting, I can see ton of potential uses.
|
| 2 questions:
|
| 1) how does it compare to state of the art FOSS solutions? I'm
| seeking about DeepSpeech or Vosk
|
| 2) would it be somehow possible to associate timestamp to the
| words recognized? That would be amazing for things such as audio
| editing or skipping to a particular location on a video
| nshm wrote:
| You properly mentioned timestamps. There are many other
| important properties of good ASR system like vocabulary
| adaptability (if you can introduce new words) or streaming. Or
| confidences. Or latency of the output. Compared to Vosk models
| this model can not work in streaming manner, so not very
| suitable for real-time applications.
|
| But in general the model is robust and accurate and trained on
| the amount of speech we never dreamed about in Vosk. We will
| certainly benefit from this model as a teacher (together with
| others like gigaspeech models). I recently wrote about it
| https://alphacephei.com/nsh/2022/06/14/voting.html
| goffi wrote:
| > goffi
|
| for 2), it's actually written in the description: "phrase-level
| timestamps", so it should be possible (phrase level is neat for
| skipping to a special location on a video, but maybe not for
| audio editing).
| IceWreck wrote:
| Is there a list of system requirements somewhere ? Can it run on
| cheaper low memory GPUs ? maybe CPUs ?
| StevenWaterman wrote:
| Their models range from 70mb to 3gb. The largest model is
| smaller than the optimised stable diffusion. Not sure what the
| inference speed is like, haven't tried it myself yet.
| IceWreck wrote:
| I just tested it myself. Its fast enough on colab, couple of
| seconds but not sure if its fast enough to transcribe
| realtime audio yet.
| [deleted]
| mewse-hn wrote:
| I know this isn't a tech support forum but maybe someone here
| knows. I'm attempting the sample python code from the github and
| _almost_ get a transcription running on my work laptop without a
| GPU, but I run into this error message:
|
| >>> result = whisper.decode(model, mel, options)
|
| Traceback (most recent call last):
|
| [snip]
|
| RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
|
| It looks like a Torch error, is there some twiddling with
| "options" I can do to get it to run?
| mewse-hn wrote:
| I seem to have worked around it by tweaking the "options" line
| from the sample code to this:
|
| >>> options = whisper.DecodingOptions(fp16=False)
| O__________O wrote:
| Anyone know if it is possible to output IPA using this?
|
| International Phonetic Alphabet (IPA)
|
| - https://wikipedia.org/wiki/International_Phonetic_Alphabet
|
| _________
|
| EDIT: Based on list of languages in the tokenizer code here,
| doesn't appear IPA is supported:
|
| https://github.com/openai/whisper/blob/5f8d4bcc254d4f3e833d3...
| jcims wrote:
| Did respectably with some mumble rap:
| https://controlc.com/d353dafb
|
| (some NSFW words in the lyrics obv)
| derangedHorse wrote:
| Whisper performed a lot better than I would've expected it to!
| mmh0000 wrote:
| Okay this is super impressive. I just downloaded Whisper and fed
| it a random flac file I had handy and it did a really good job.
| Also impressive that it works on my weak CPU:
|
| A 3m07s flac took 5m to transcribe: $ whisper
| --device cpu 'BLACKPINK - BORN PINK/01 Pink Venom.flac'
| Detecting language using up to the first 30 seconds. Use
| `--language` to specify the language Detected language:
| korean [00:00.000 --> 00:10.000] Blackpink
| [00:11.000 --> 00:14.000] Kick in the door, wave in the coco
| [00:14.000 --> 00:16.000] pabkonineun cinge ggyeodeul saenggag
| malgo [00:16.000 --> 00:19.000] I talk to talk, run ways I
| walk walk [00:19.000 --> 00:21.000] him gamgo pab pab an
| bwado ceog [00:21.000 --> 00:24.000] By one and two by two
| [00:24.000 --> 00:26.000] nae songgeut du hanae tamyeon ajieun
| jung [00:26.000 --> 00:30.000] gas jasyo jigeum hwaryeohae
| T makes no sense [00:30.000 --> 00:32.000] You couldn't
| get a dollar out of me [00:33.000 --> 00:38.000] ja oneul
| bamiya nuntobeul pumgo [00:38.000 --> 00:41.000] mihoneul
| bbaeseum down [00:41.000 --> 00:43.000] Look what you made
| us do [00:43.000 --> 00:47.000] ceonceonhi neol jamjaeul
| paieo [00:48.000 --> 00:52.000] jami nal mankeum
| areumdaweo [00:52.000 --> 00:53.000] I bring the pain like
| [00:53.000 --> 00:57.000] diseutab, paengpaeng, diseutab,
| paengpaeng, diseutab, paengpaeng, paengpaeng [00:57.000 -->
| 00:58.000] Get em, get em, get em [00:58.000 -->
| 01:00.000] Straight till you don't like [01:00.000 -->
| 01:01.000] Whoa, whoa, whoa [01:01.000 --> 01:03.000]
| Straight till you don't like [01:03.000 --> 01:04.000] Ah,
| ah, ah [01:04.000 --> 01:05.000] Taste that, pink venom
| [01:05.000 --> 01:06.000] Taste that, pink venom
| [01:06.000 --> 01:08.000] Taste that, pink venom
| [01:08.000 --> 01:09.000] Get em, get em, get em
| [01:09.000 --> 01:11.000] Straight till you don't like
| [01:11.000 --> 01:12.000] Whoa, whoa, whoa [01:12.000 -->
| 01:13.000] Straight till you don't like [01:13.000 -->
| 01:14.000] Ah, ah, ah [01:14.000 --> 01:15.000] Blackpink
| and Amo [01:15.000 --> 01:17.000] Got it by the smack ram
| [01:17.000 --> 01:18.000] But rest in peace [01:18.000 -->
| 01:19.000] Please light up a candle [01:19.000 -->
| 01:20.000] This the knife of a vando [01:20.000 -->
| 01:22.000] Messed up and I'm still in saline ...SNIP...
| lunixbochs wrote:
| Looks like it defaults to the model called "small".
|
| I just ran some benchmarks - M1 Max, pytorch, with a 1.29
| second flac (looks like the matrix math was running on a single
| thread): tiny 146.522ms detect_lang
| 549.131ms decode_one 0.057ms tokenizer
| base 354.885ms detect_lang 1046.679ms
| decode_one 0.011ms tokenizer small
| 803.892ms detect_lang 3194.503ms decode_one
| 0.017ms tokenizer medium 2279.689ms
| detect_lang 10128.255ms decode_one 0.023ms
| tokenizer large 3656.478ms detect_lang
| 17249.024ms decode_one 0.016ms tokenizer
| lazylion2 wrote:
| I ran it on this clip
|
| https://clips.twitch.tv/ReliablePopularWerewolfOSkomodo-pcuw...
|
| because... hard accent.
|
| first run whisper thought its welsh so I had to run with
| --language en , and it did pretty well
|
| https://i.imgur.com/TQiYU9X.png
|
| took 36 seconds in Google colab
| manishsharan wrote:
| Oh this is a relief to have something opensource in this field. I
| had using Mozilla Deepspeech for transcribing my voice notes ,
| often with hilarious to incomprehensible results. DeepSpeech is
| dead ; so I will be sure to check this out.
| w10-1 wrote:
| Naively, training the same model on multiple languages has
| interesting implications.
|
| On one hand, it may capture something "deeper" about language.
|
| On the other hand, it's likely to do great in general, but miss
| particularities of some language.
|
| Understanding the coverage of the training model seems a
| perennial problem. Is there any (shorthand) way to compare
| language model training corpora?
|
| Clearly if they use common subsets we have a literal comparison.
| I'm more interested in whether there's progress in characterizing
| corpora by speech styles, fluency, vocabulary sets, (noise)
| environment, emotionality, proposition types, etc.
|
| (btw: 25 minutes for a 9-minute segment on a 12-thread x86. Lots
| of jargon spelled as it sounds. Sentences capitalized but no
| punctuation. Overall good.)
| dindindin wrote:
| I'm not in the Speech Recognition circles and am looking for open
| source speech recognition I can play around with - would this be
| the new state of the art?
| mercurywells wrote:
| For me as a deaf person the current state of art (in terms of
| speed & usability) is the Recorder app on a Google Pixel phone
| (4a/6 Pro is what I've used)
| StevenWaterman wrote:
| Yes
| visarga wrote:
| Most probably
| The5thElephant wrote:
| How is it Apple, Google, or Microsoft are not further ahead of
| the game on speech recognition like this? They have the resources
| to hire the best ML researchers and throw tons of computing hours
| at it, yet Siri, Google, and Cortana continue to struggle to get
| anywhere near this level of comprehension.
| wongarsu wrote:
| Siri and Cortana have to run at least in real time, with
| reasonable compute resources. Probably faster than real time
| when the audio gets shipped off to the cloud and transcribed
| there. This model can't do that (in the "large" version, which
| the examples use).
|
| Also, you are comparing Whisper's highlight reel with everyday
| performance of other models. Nobody shows their weaknesses in
| their highlight reel.
| alex_marchant wrote:
| Siri until ios 15 was done in the cloud IIRC.
| coder543 wrote:
| Someone else in this thread[0] said Whisper was running at
| 17x real time for them. So, even a weak machine might be able
| to do an acceptable approximation of real time with Whisper.
|
| Also, I feel like shipping to the cloud and back has been
| shown to be just as fast as on device transcription in a lot
| of scenarios. Doing it on device is primarily a benefit for
| privacy and offline, not necessarily latency. (Although,
| increasingly powerful smartphone hardware is starting to give
| the latency edge to local processing.)
|
| Siri's dictation has had such terrible accuracy for me (an
| American English speaker without a particularly strong
| regional accent) and everyone else I know for so many years
| that it is just a joke in my family. Google and Microsoft
| have much higher accuracy in their models. The bar is so low
| for Siri that I automatically wonder how much Whisper is
| beating Siri in accuracy... because I assume it has to be
| better than that.
|
| I really wish there was an easy demo for Whisper that I could
| try out.
|
| [0]: https://news.ycombinator.com/item?id=32928207
| lunixbochs wrote:
| 17x realtime _on a 3090_
|
| I did some basic tests on CPU, the "small" Whisper model is
| in the ballpark of 0.5x realtime, which is probably not
| great for interactive use.
|
| My models in Talon run closer to 100x realtime on CPU.
| coder543 wrote:
| "CPU" isn't necessarily the benchmark, though. Most
| smartphones going back years have ML inference
| accelerators built in, and both Intel and AMD are
| starting to build in instructions to accelerate
| inference. Apple's M1 and M2 have the same inference
| accelerator hardware as their phones and tablets. The
| question is whether this model is a good fit for those
| inference accelerators, and how well it works there, or
| how well it works running on the integrated GPUs these
| devices all have.
|
| Brute forcing the model with just traditional CPU
| instructions is fine, but... obviously going to be pretty
| slow.
|
| I have no experience on the accuracy of Talon, but I've
| heard that most open source models are basically overfit
| to the test datasets... so their posted accuracy is often
| misleading. If Whisper is substantially better in the
| real world, that's the important thing, but I have no
| idea if that's the case.
| lunixbochs wrote:
| See https://news.ycombinator.com/item?id=32929029 re
| accuracy, I'm working on a wider comparison. My models
| are generally more robust than open-source models such as
| Vosk and Silero, but I'm definitely interested in how my
| stuff compares to Whisper on difficult held-out data.
|
| > Brute forcing the model with just traditional CPU
| instructions is fine, but... obviously going to be pretty
| slow.
|
| It's not that simple. Many of the mobile ML accelerators
| are more targeted for conv net image workloads, and
| current-gen Intel and Apple CPUs have dedicated hardware
| to accelerate matrix math (which helps quite a bit here,
| and these instructions were in use in my tests).
|
| Also, not sure which model they were using at 17x
| realtime on the 3090. (If it's one of the smaller models,
| that bodes even worse for non-3090 performance.) The 3090
| is one of the fastest ML inference chips in the world, so
| it doesn't necessarily set realistic expectations.
|
| There are also plenty of optimizations that aren't
| applied to the code we're testing, but I think it's
| fairly safe to say the Large model is likely to be slow
| on anything but a desktop-gpu-class accelerator just due
| to the sheer parameter size.
| lunixbochs wrote:
| Ok, my test harness is ready. My A40 box will be busy
| until later tonight, but on an NVIDIA A2 [1], this is the
| batchsize=1 throughput I'm seeing. Common Voice, default
| Whisper settings, card is staying at 97-100% utilization:
| tiny.en: ~18 sec/sec base.en: ~14 sec/sec
| small.en: ~6 sec sec/sec medium.en: ~2.2 sec/sec
| large: ~1.0 sec/sec (fairly wide variance when ramping up
| as this is slow to process individual clips)
|
| [1] https://www.nvidia.com/en-us/data-center/products/a2/
| coder543 wrote:
| Isn't the A2 much weaker than a 3090? So those results
| are promising.
|
| EDIT: for what it's worth, Nvidia rated the A2 at 18
| TFLOPS of FP16, and Apple rates the current A16 Neural
| Engine at 17 TFLOPS of FP16. I'm sure it's not an "apples
| to apples" comparison.
| The5thElephant wrote:
| Good point about realtime or not, however with ML I have
| found the weaknesses get addressed pretty fast by someone.
| There is a big step between proof of concept and practical
| application though, so we shall see.
| Kuinox wrote:
| OpenAI is owned by Microsoft FYI.
| neongreen wrote:
| Is it? Googling suggests that Microsoft invested in OpenAI
| but doesn't actually own it.
| Kuinox wrote:
| Oh, my bad looks like they only bought an exclusive license
| to GPT3.
| fxtentacle wrote:
| This AI has a 30 second delay on the audio processing because
| it needs to be able to "look into the future" to get these good
| results. That 30s delay would be unacceptable for
| Siri/Google/Cortana.
| coder543 wrote:
| A lot of models we currently use seem to do the same thing.
| The model will transcribe a "best effort" interpretation in
| real time, then as you can continue speaking, you'll see it
| go back and make corrections. I'm sure you can feed the first
| X seconds you have into the model, followed by (30-X) seconds
| of silence, and it will do real time transcription just
| fine... it would be weird if this broke anything. Then, as
| you get more speech, you continue getting better
| transcription of the first 30 seconds, then you switch to a
| 30 second sliding window.
|
| Maybe I'm missing something, but I don't see the problem
| here.
| fxtentacle wrote:
| Yes, that's because Whisper - like pretty much all of them
| - uses a Transformer encoder with Attention layers. And the
| Attention layers learn to look into the future.
|
| And yes, what you describe could be done. But no, it won't
| reduce latency that much, because the model itself learns
| to delay the prediction w.r.t. the audio stream. That's why
| ASR-generated subtitles usually need to be re-aligned after
| the speech recognition step. And that's why there is
| research such as the FastEmit paper to prevent that, but
| then it is a trade-off between latency and quality again.
|
| Also, running your "low-latency" model with 1s chunks means
| you now need to evaluate the AI 30x as often as if you'd be
| using 30s chunks.
| coder543 wrote:
| You just said the models pretty much all work the same
| way, then you said doing what I described won't help. I'm
| confused. Apple and Google both offer real time, on
| device transcription these days, so _something_ clearly
| works. And if you say the models already all do this,
| then running it 30x as often isn 't a problem anyways,
| since again... people are used to that.
|
| I doubt people run online transcription for long periods
| of time on their phone very often, so the battery impact
| is irrelevant, and the model is ideally running (mostly)
| on a low power, high performance inference accelerator
| anyways, which is common to many SoCs these days.
| fxtentacle wrote:
| I meant that most research that has been released in
| papers or code recently uses the same architecture. But
| all of those research papers use something different than
| Apple and Google.
|
| As for running the AI 30x, on current hardware that'll
| make it slower than realtime. Plus all of those 1GB+
| models won't fit into a phone anyway.
| beastman82 wrote:
| In my unmeasured empirical observation Google has amazing
| speech recognition
| jeffbee wrote:
| I tried feeding the four examples from this announcement into
| Google as dictation inputs and it just sits there blankly. On
| the JFK speech test file in the repo, Google understands
| perfectly. The samples in the announcement are clearly
| outside the capabilities of anything Google has launched
| publicly, but I don't know how that translates to overall
| utility in every day applications.
| The5thElephant wrote:
| I agree they have the best compared to Apple, Amazon,
| Microsoft. However I don't think it is as good as what is
| being shown here by OpenAI.
| Vetch wrote:
| My experience with the APIs is Google is excellent and
| Microsoft is slightly better. And the offline model I've
| been using that's nearly as good as both is facebook's
| wav2vec2-large-960h-lv60-self.
|
| Don't believe what's on marketing pages, they rarely
| transfer to the real world. Will have to make time to try
| it and see. In theory, given task diversity and sheer
| number of hours, it should be a lot more robust but will
| wait on evidence before believing any claims on SoTA.
| andy_xor_andrew wrote:
| Hold on, it does not only speech recognition, but also language
| translation, in the same model?
|
| What an interesting approach. What benefits does this have over
| having two dedicated models, one for speech-to-text, and another
| for translation?
|
| It just seems so odd, given the problems of speech-to-text and
| Spanish-to-English seems so different from one another (in terms
| of the problem domain). Seems so unusual to have both handled by
| one model!
|
| Does knowledge of speech-to-text carry over into knowledge of
| translation? Does knowledge of translation carry over into
| knowledge of speech-to-text? So weird.
| newhaus1994 wrote:
| My understanding is that multi-modal models are the primary
| focus of OpenAI right now, due to their stated goal of
| achieving AGI. This product is probably better thought of as an
| offshoot of their work to create a fully generalizable model,
| rather than a specific attempt to provide
| translation/transcription services.
| TaylorAlexander wrote:
| It seems these days that language-oriented models are commonly
| becoming multilingual by default. There are a lot of common
| threads when understanding sentence construction between
| different languages. French and English have different rules
| but they will still have things like nouns, adjectives,
| subjects, prepositions, etc. It seems that by training models
| on many languages you get both a more robust understanding of
| language, and it saves you the trouble of having to make many
| more localized models for every language. I also believe that
| the other languages help the models construct sentences in
| languages which have very small training sets. If it has a few
| examples in a rare language as well as good translations to a
| better-known language, then it can provide good support for the
| rare language.
|
| We also see in image generation models that multi-modal
| networks are more powerful than single purpose networks. As we
| move towards more advanced AI systems I suspect we will see
| more and more generalizable networks with distinct advantages
| over separate networks that get plugged together.
| magicalhippo wrote:
| Would a multilingual modal perhaps also be better at
| understanding non-natives speech?
| thuttinger wrote:
| I tried running it in realtime with live audio input (kind of).
|
| If you want to give it a shot, you can find the python script in
| this repo: https://github.com/tobiashuttinger/openai-whisper-
| realtime
|
| A bit more context on how it works: The systems default audio
| input is captured with python, split into small chunks and is
| then fed to OpenAI's original transcription function. It tries
| (currently rather poorly) to detect word breaks and doesn't split
| the audio buffer in those cases. With how the model is designed,
| it doesn't make the most sense to do this, but i found it would
| be worth trying. It works acceptably well.
| minimaxir wrote:
| The model output can be tweaked to produce audio embeddings (akin
| to BERT for text embeddings and CLIP for image embeddings), which
| can lead to some _interesting_ applications as the previous two
| examples have demonstrated.
| FerociousTimes wrote:
| What do you mean exactly by audio embeddings?
| minimaxir wrote:
| Represent a given set of audio inputs as a numeric vector,
| which can then for example be finetuned for other ML/AI
| problems or placed in an embeddings database for easy ANN
| search with similar audio clips. In the extreme case it could
| facilitate better AI audio generation similar to how CLIP can
| guide a VQGAN.
|
| Although the 30 second minimum input is a bit of a bummer
| since it may not allow much granularity in the resulting
| embeddings.
| lynguist wrote:
| How can I use this (or something similar) for live translation? I
| don't mind if there's a 30s delay.
|
| As in I don't want to input a file, I want to input the
| microphone sound.
| agnos wrote:
| Would also like to know this. It looks like they're processing
| the audio file in 30 second chunks, so a naive approach of
| keeping a buffer of 30-second input stream chunks and just
| continually writing to an output .mp3 could work...
| blueberrychpstx wrote:
| Was wondering the same.
|
| I really wish I would have been paying attention in Unix
| class...
|
| Something like `microphone | chunk 3s | whisper | stdout` would
| be SO COOL!!! I think that's possible but too lazy to look
| more.
| spywaregorilla wrote:
| Hmm are there any noteworthy open sourced speech to speech
| models? Like transform a spoken line to another voice, copying
| both the words spoken and the inflections?
| cercatrova wrote:
| Their Scottish accent example is pretty good, I'd like to see it
| work on some very strong English accents like this one:
| https://www.youtube.com/watch?v=nJ7QB3om-QY
| homarp wrote:
| Detected language: english
|
| [00:00.000 --> 00:05.400] Gordy and County Kerry are
| investigating the theft of up to 60 sheep on Mount Brandon.
|
| [00:05.400 --> 00:10.400] One of the farmers is offering a
| reward for information leading to the return of the use,
|
| [00:10.400 --> 00:12.200] which are worth thousands of euro.
|
| [00:12.200 --> 00:14.200] Well, I'm fine with that.
|
| [00:14.200 --> 00:15.200] That's right.
|
| [00:15.200 --> 00:16.200] Do you own them?
|
| [00:16.200 --> 00:17.200] Anyone can say it.
|
| [00:17.200 --> 00:18.200] Fine with that.
|
| [00:18.200 --> 00:22.720] Last Saturday, Mikey Joe O'Shea
| brought his flock of Scotch sheep down from the mountain
|
| [00:22.720 --> 00:25.320] commonage ahead of lambing.
|
| [00:25.320 --> 00:29.840] He discovered over 50 were missing,
| allowing for a number of deaths and
|
| [00:29.840 --> 00:30.840] strays.
|
| [00:30.840 --> 00:34.600] Mikey is convinced over 45 sheep have
| been stolen.
|
| [00:34.600 --> 00:35.600] It was a good night.
|
| [00:35.600 --> 00:36.600] It would be a full moon there.
|
| [00:36.600 --> 00:37.600] It would be a good night.
|
| [00:37.600 --> 00:38.600] It would be bright out.
|
| [00:38.600 --> 00:40.600] There could be anyone going up in the
| mountains.
|
| [00:40.600 --> 00:41.600] It would be a good night.
|
| [00:41.600 --> 00:43.600] Well, that was 45 sheep missing.
|
| [00:43.600 --> 00:49.600] Mikey and the lambs and everything in
| the sheep, they counted out a nice bit of money.
|
| [00:49.600 --> 00:52.200] They've been doing the boat in
| Nassan.
|
| [00:52.200 --> 00:53.200] It's a big one. [00:53.200 -->
| 00:54.200] It's a big one. [00:54.200 --> 00:55.200] It's a big
| one.
|
| [00:55.200 --> 00:59.000] Mikey's next door neighbor says some
| of his sheep have also been stolen.
|
| [00:59.000 --> 01:00.000] Come back. [01:00.000 --> 01:01.000]
| Come back. [01:01.000 --> 01:02.000] Come back.
|
| [01:02.000 --> 01:03.000] I've been missing about 10 years.
|
| [01:03.000 --> 01:04.000] It's not all that difficult.
|
| [01:04.000 --> 01:06.320] All they've got to do is have a good
| dog.
|
| [01:06.320 --> 01:10.560] Have a good dog and go at night, some
| moonshine night.
|
| [01:10.560 --> 01:11.560] Just put the dog around him.
|
| [01:11.560 --> 01:14.120] Put him on a trailer and walk him.
|
| [01:14.120 --> 01:18.360] And then probably somebody else to
| pick him up.
|
| [01:18.360 --> 01:29.960] Everybody's doing it north, but he's
| doing it.
| cercatrova wrote:
| Wow that is incredibly impressive. At 0:53 is it translating
| as well? Didn't sound like English to me.
| mod wrote:
| Those are Irish.
| biggerChris wrote:
| We have reached sentient mode.
| dom96 wrote:
| This really makes me want to build a Amazon Echo/Google Nest/etc
| replacement that's open hardware, open source and most
| importantly recognises voice completely offline. I find that I
| don't use these smart devices for much more than setting timers
| anyway so this seems like an easy project.
|
| I just wonder what system requirements Whisper has and whether
| there are open source voice recognition models that are
| specifically built for embedded devices.
| solarkraft wrote:
| Are you thinking about reimplementing Mycroft?
|
| The Mycroft has done a lot of cool and important work in the
| field to ship an actual personal assistant product (stuff like
| wake word detection).
| dom96 wrote:
| hah, of course someone had the idea already and executed on
| it. But yeah, basically that but without the screen (probably
| would go a long way to decrease the cost, $299 is pretty
| steep for such a device)
| suyash wrote:
| This is only one side of the coin, you still need really good
| models for Speech Synthesis and then be able to have it all
| working in almost real time, ideally locally on device.
| ricopags wrote:
| As far as TTS goes, Mycroft.ai[0] has released a decent
| offline one.
|
| [0]https://mycroft.ai/
| MacsHeadroom wrote:
| I really want all this too. The smallest model is ~80mb and the
| largest is 3gb. Not sure about system requirements yet; but
| models that small suggest this may be doable locally on a
| single board computer.
|
| Edit: According to this comment[0] the base model runs in real
| time on an M1 CPU. The tiny model apparently decodes an audio
| file twice as fast. These are promising results.
|
| [0] https://news.ycombinator.com/item?id=32927360#32929739
| dom96 wrote:
| I'd be interested to see how well it performs on something
| like an RPi. M1 is pretty beefy.
| TOMDM wrote:
| Given how robust it seems to be with fast speech, I wonder if you
| could save cycles by speeding up the audio before feeding it in.
| eatsyourtacos wrote:
| Can this be used as a real-time transcription or is it too slow
| for that?
|
| Curious what anyone is using these days for a real-time
| transcription. It doesn't have to be perfect, but just good
| enough.
|
| My kids watch some youtube vidoes where people will make a mod
| where it converts them talking to text then look for keywords and
| spawn a boss in Terraria if you say the wrong keyword etc.
|
| I made a clone of that with the .NET System.Speech.Recognition
| library. It... works.. but my biggest problem is that #1 it waits
| until you are done speaking to translate to text on the callback,
| so there was too much of a delay for it to be fun.. the point is
| that it will be checking a stream of chatter. #2 is the
| recognition is pretty crap, I mean it's nearly good enough for my
| silly purpose but it's still pretty bad.
| blueberrychpstx wrote:
| If your family uses Apple devices, Apple offers free on-device
| speech recognition. Only caveat is that it needs to be
| restarted every minute due to whatever stupid limitation (or
| bug) they've introduced.
|
| https://developer.apple.com/documentation/speech/recognizing...
|
| Also, see `requiresOnDeviceRecognition`
| [deleted]
| [deleted]
| nshm wrote:
| Try https://github.com/alphacep/vosk-
| api/blob/master/csharp/demo...
| whimsicalism wrote:
| It might require too much work for what you are looking for,
| but the wav2letter library is the best real-time transcription
| OSS I have found by a considerable margin.
| davidzweig wrote:
| Out of interest, did you try Nemo?
| https://github.com/NVIDIA/NeMo
| whimsicalism wrote:
| No. I dont think it had streaming capabilities when i was
| doing this test two years ago, although i see it does now.
| TaylorAlexander wrote:
| The base model seems to run faster than real time on my
| machine. The "medium" model is larger and runs more slowly -
| roughly real time or maybe slightly slower.
| suyash wrote:
| Depends if you're trying to run it offline or over the cloud.
| tgtweak wrote:
| Good to see them releasing model weights - hopefully now that
| Stable Diffusion is out they will release Dall-E 2 source and
| weights as well.
| knaik94 wrote:
| I got a super weird results with the 'medium' and language
| Japanese (with a --task translate). The song is False Sympathy by
| Mondo Grosso.
|
| "[01:17.000 --> 01:32.000] Translated by Releska" when using the
| translate to english. That entire part of the song is
| instrumental. This line does not appear at all in the original
| transcribe only in the opus format rip.
|
| It shows up in the yt rip in format 251 (opus), but not in format
| 140 (aac from youtube), nor the flac rip. All three are giving
| different results.
|
| The translation quality is tied to bitrate. Same song converted
| to different words, the only difference being bitrates and
| formats. Converting my own rip with the same parameters as yt
| (opus @140 and then @130) didn't allow me to reproduce this
| error.
|
| The model hung for a solid extra minute at the end when
| translating to english, the last 90ish seconds of the song took
| real time 60 seconds, while the entire rest took about 90. The
| same behavior was not observed with the transcribe.
|
| Some of the english words are incorrect but that was expected.
| The first Japanese "mistake" I found was "Quan tehaEr Ren no"
| instead of "subeteha hutarino". With the left being what whisper
| wrote. A single random word "hey" was transcribed/translated to
| english even though it's the singer elongating the Yuan while
| singing the Le Yuan . "Luo chiteyuku Er Ren deXi garetaEr Ren
| noragu HEY" instead of "Luo chiteiku Suo detsunagareta Er Ren
| noLe Yuan " .
|
| I am using the official subtitles released on the youtube video.
|
| It's a complex Japanese song with both japanese and english, and
| the original transcribe took about 20 real time seconds to start
| with the first line, 130 seconds for the whole song. It seems to
| be showing results in 20 second window increments, but this seems
| to depend on what it considers audio and what it is throwing
| away.
|
| On my computer I wasn't able to use the large model because I ran
| out of VRAM, I have 8gb, not sure how much more it'd require. So
| I ran it with medium.
|
| The song is False Sympathy by Mondo Grosso. The mv is suggestive,
| in case that matters. I grabbed a fresh audio rip from Youtube
| because I didn't want to take it out of my cd case.
|
| https://www.youtube.com/watch?v=B6Y-WsgpzlQ
|
| It is translating this version differently from the director's
| cut version. I ripped both as opus.
|
| There is something weird about how it is handling the opus
| encoded version, as I find the same "Translated by Releska" in a
| wav version transcoded from the opus.
| amrrs wrote:
| Here's a live demo on Hugging Face Spaces if you want to try -
| https://huggingface.co/spaces/Amrrs/openai-whisper-live-tran...
| clemnt wrote:
| this is amazing! got it working in French too
| TaylorAlexander wrote:
| Hey this looks great! I like to record audio notes while driving
| in my car after work, to kind of decompress my thoughts from the
| day. But I never go back and listen as they can be long and
| meandering. Sometimes in the audio log I will sum up my thoughts,
| but this might be 20 minutes in and hard to find. I really wish I
| had transcriptions so I could easily scan the full contents. I
| have tried Mozilla Deepspeech (I don't want a cloud solution) and
| I was surprised to find that I could not get Deepspeech to
| reliably transcribe them. There is a bit of road noise, though I
| think for a human listener they are easy to understand. It looks
| like this one might actually do the trick!
|
| EDIT: Tried it and it worked great! It is very easy to use. I
| just did the pip install line in the readme and was ready to go.
| You literally just run the one pip install line, and then you run
| the program in the format "whisper my_audio.wav" and it goes.
| Really nice job OpenAI!
| zhynn wrote:
| I do this too! I have been doing it for about a year now, and
| haven't ever run into someone else that does this kind of
| audio-journaling. Would you be up for comparing notes sometime
| about how it is working out for you? I am finding that it is
| extremely effective form of self-care, but with lots of
| personal caveats. I would be so interested to hear your
| experience.
| blueberrychpstx wrote:
| Count me in!! Working on tools actually to turn these
| transcriptions into something more social
| tekacs wrote:
| I do this too, and I've built some software for it just for
| myself.
|
| I'd love to chat and hear about how you use this! My email is
| in my profile, or I'm @tekacs on Twitter (and everywhere). :)
| TaylorAlexander wrote:
| Oh cool! Yeah I have stopped doing it lately as I was not
| really using them (I would like to use them for making rough
| notes for future youtube video scripts), though in general it
| does seem like good self care too even if I don't review
| them. That said I just tried the base model on one of my
| voice logs and it was pretty good! Trying the medium model
| now and it seems basically perfect. So I will have to start
| doing these logs more!
|
| Anyway I am pretty terrible with email but short exchanges
| can work for me, or maybe we can connect over signal. Send me
| a message to my email in my profile and I would be happy to
| sync up!
| Snitch-Thursday wrote:
| Google's recorder app for android will let you record audio
| files and make some transcriptions, right on the device.
| Tenoke wrote:
| I just tested it and it was pretty mediocre at least with my
| accent. I can definitely benefit from a decent app for quick
| note recording with a button press->transcribe->upload to
| gdrive/good UI app for later grepping.
| TaylorAlexander wrote:
| Was this with the default base model, or the medium or
| large model? This can be specified with the --model flag.
| Tenoke wrote:
| I meant the 'Google's recorder app' from the parent
| comment and not Whisper.
| capableweb wrote:
| Is that application actually doing on-device transcription?
| Under "Data safety" on the Google Play page it says "This app
| may share these data types with third parties: Audio" which
| doesn't exactly instill confidence that my audio will 100%
| always stay on my device. It also says "Data is encrypted in
| transit" but if data stays on the device, why it has to be
| "encrypted in transit"? There should be no transit at all.
| petercooper wrote:
| I'll probably explore using this, but I've used an app called
| Just Press Record to do what you say. Runs on Apple Watch too,
| so you can tap a complication at any time in the day, speak,
| and you get a transcript on your phone, etc.
| anigbrowl wrote:
| Oh nice - I have an immediate use case for this. This looks
| accessible enough that the sci-fi dream of instantaneous audio
| translation is suddenly within reach.
| petercooper wrote:
| Just tested this on some developer podcasts which usually fail
| hard given they're full of technical jargon, brand names, etc.
| Whisper is a revolution! It's picking up terms like Heroku,
| DigitalOcean, GitHub, ECS, AWS, etc. and capitalizing properly -
| something nothing else did unless you provided a whole pile of
| guiding vocabulary.
| ma2rten wrote:
| Did these podcasts have transcripts? You might be inadvertently
| evaluating it on data that it was trained on, which is
| basically cheating. Even if not, it might be trained on similar
| podcasts. Judging how good these kinds of models are is really
| hard.
| WiSaGaN wrote:
| True. The test should only be done on the material released
| _after_ the model.
| Jnr wrote:
| Cool!
|
| I am one of the top contributors to the tiny Mozilla Common Voice
| data-set for my language. The data-set is very small compared to
| other popular languages and none of the other mentioned data-sets
| contribute to that language to train the model of Whisper.
|
| And even with so little data to train on it still works
| surprisingly well.
| jdmoreira wrote:
| Looking forward to see if this works well with foreign accents
| mminer237 wrote:
| They have an example in the post with a very thick Scottish
| accent. You should listen to it. It's pretty impressive.
| localy wrote:
| Are there any published benchmarks available outlining how this
| compares to other open source ASR software, such as Coqui.ai?
| bickett wrote:
| Hard to keep up with all the great things. The AI community is
| really moving quick right now.
| aidenn0 wrote:
| For those on NixOS, here's a quick and dirty flake.nix that will
| let you make a venv in which to "pip install"'
|
| Just put it in a flake.nix, and "nix develop" followed by
| "virtualenv ./venv; . ./venv/bin/activate; pip install
| git+https://github.com/openai/whisper.git" {
| description = "Python 3.9 development environment";
| outputs = { self, nixpkgs }: let system
| = "x86_64-linux"; pkgs = import nixpkgs { inherit
| system; }; in {
| devShells.${system}.default = pkgs.mkShell {
| buildInputs = [ pkgs.ffmpeg
| pkgs.python39 pkgs.python39Packages.pip
| pkgs.python39Packages.numpy
| pkgs.python39Packages.pytorch
| pkgs.python39Packages.virtualenv ];
| }; }; }
| aidenn0 wrote:
| This should, in theory, work with CUDA; my GPU doesn't have
| enough RAM to do it (it runs out at 2.9GiB allocated, I have
| 4GiB, but am running a compositing desktop, which chews up
| about 600MiB; not sure where the other ~400MiB went)
|
| [edit]
|
| I confirmed CUDA worked with the "small" model, which used
| 3.3GB of GPU ram, and resulted in _much_ poorer recognition
| than the "medium" model on my CPU (but it ran at least two
| orders of magnitude faster). {
| description = "Python 3.9 development environment";
| outputs = { self, nixpkgs }: let system =
| "x86_64-linux"; pkgs = import nixpkgs {
| inherit system; config.allowUnfree = true;
| config.cudaSupport = true; }; in {
| devShells.${system}.default = pkgs.mkShell {
| buildInputs = with pkgs; [ cudatoolkit
| linuxPackages.nvidia_x11 cudaPackages.cudnn
| libGLU libGL xorg.libXi xorg.libXmu freeglut
| xorg.libXext xorg.libX11 xorg.libXv xorg.libXrandr zlib
| ncurses5 stdenv.cc binutils ffmpeg
| python39 python39Packages.pip
| python39Packages.numpy
| python39Packages.pytorch-bin
| python39Packages.virtualenv ];
| shellHook = '' export
| LD_LIBRARY_PATH="${pkgs.linuxPackages.nvidia_x11}/lib"
| ''; }; }; }
| magicalhippo wrote:
| CUDA worked fine with large on my 2080Ti FWIW. The speedup is
| ridiculous, as expected. My Ryzen 3800X used almost an hour
| transcribing a minute worth of speech, while the 2080Ti does
| it in like 10-20 seconds.
| BasilPH wrote:
| Any opinions on what this means for speech-to-text companies like
| rev.ai and assmembly.ai ?
|
| We've tested open source solutions for s2t, like kaldi, but the
| quality was not good enough. However, one of the main advantages
| of a service like assembly.ai to me was that they offer sentence
| splitting in form of punctuation and speaker detection, which
| Kaldi does not.
|
| So I guess I answered my own question to some degree: A S2T
| service is more than just S2T. We already see assembly.ai add
| more and more features (like summarisation, PID redaction ect.)
| that are a value-add to plain S2T.
|
| Still, curious to hear what your take on that is.
| nshm wrote:
| You can apply public punctation model from Vosk on top of Kaldi
| output, you can also get speaker labels with existing open
| source software.
|
| On quick video transcription test this model is more accurate
| than AssemblyAI and Rev AI. It will be harder for them to sell
| pure ASR now. Some more business-oriented applications will
| still be important though, for example ASR as part of
| callcenter analytics solution or as a part of medical ERP
| system.
|
| The value of automatic summarization is small, without AI it is
| very hard to make it right, you need to be an expert in the
| field to understand what is important.
| adeptima wrote:
| Japanese results looks pretty impressive!
|
| Took matsukoukuzira14Tou gaHai An niDa chiShang gerareru
| osutoraria(2022Nian 9Yue 21Ri )
| https://www.youtube.com/watch?v=bZkNIzeRBk4
|
| Extracted audio with youtube-dl -f bestaudio
| https://www.youtube.com/watch\?v\=bZkNIzeRBk4
|
| Converted into [00:00.000 --> 00:13.000] osutorariaNan Bu noDao
| de, Zhen tsuXiang kuzira14Dong gaHai An niDa chiShang gerareteSi
| ndeirunogaJian tsukari, Zhuan Men Jia gaDiao Cha notameYuan Di Ru
| rishimashita. [00:13.000 --> 00:25.000] Yuan Di
| medeianiyorimasuto, osutorariaNan Bu nokinguDong de, 19Ri , Shao
| nakutomo14Dong noZhen tsuXiang kuziragaHai An niDa chiShang
| gerareteSi ndeirunogaJian tsukarimashita. [00:25.000 -->
| 00:31.000] hotondogaRuo iosutowoJian rare, Zhuan Men Jia gaXian
| Chang niZhong mukiDiao Cha niDang tatsuteimasu. [00:31.000 -->
| 00:41.000] kuziranoSi Hai haDa kikuYun ndariMai
| metarisurukotogaNan shiitame, Zi Ran niFen Jie sarerunowoDai
| tsuFang Zhen gaJian Tao sareteimasu. [00:41.000 --> 00:52.000]
| mata, Si Hai woJu i, samegaHai niJi maruKe Neng Xing
| gaarutoshite, Yuan Di Dong Ju hasahuanadoniZhou Wei niJin
| dukanaiyouniHu bikaketeimasu. [00:52.000 --> 01:02.000] Yi Fang
| , 21Ri nihatasumaniaDong deoyoso230Dong nokuziragaBang Bian niDa
| chiShang geraretaZhuang Tai deJian tsukarimashita. [01:02.000
| --> 01:07.000] oyosoBan Shu gamadaSheng kiteiruMo Yang deJi Zhu
| Huo Dong gaJin merareteimasu. [01:07.000 --> 01:23.000] Jian
| tsukatsutanoha, gondokuziranoZhong Jian toJian rareteimasu.
| knaik94 wrote:
| Did you try translating them to english? I want to see if you
| get a similar error as me with a random phrase "Translated by
| Releska" showing up.
| gzer0 wrote:
| Shocked at how good the results are, and how easy of an
| installation it is.
|
| Here are the exact steps to follow to get it running on Ubuntu
| 22.04 via WSL and yt-dlp: 1. pip install
| git+https://github.com/openai/whisper.git 2. yt-dlp
| -f 'ba' -x --audio-format mp3
| https://www.youtube.com/watch/?v\=bZkNIzeRBk4 3.
| renamed the file to test.mp3 4. whisper test.mp3
| --language Japanese --task translate --model large
|
| Note: the large model will download a ~3Gb file
| tullie wrote:
| Great to see OpenAI finally being open :)
| nicholasjarnold wrote:
| This is so cool! I was just speaking to a non-technical family
| member about privacy concerns around using "OK Google" and the
| like. They responded inquiring about "private" alternatives, to
| which my answer was "I'm not aware of good ones that give you
| that level of accuracy and convenience."
|
| Perhaps this development along with continued optimization and
| device compute power increases will lead us into a near-future
| where things like Mycroft devices and cellphones could have
| local-only speech-to-text and translation capabilities which are
| accurate even with environmental background noise variations
| encountered IRL.
|
| Great work OpenAI team!
| mwlp wrote:
| Super impressive. I tested it on a Japanese streamer whose
| enunciation isn't exactly perfect and it did a decent job:
| https://www.youtube.com/watch?v=ROiOU1scaNA
| [00:00.000 --> 00:06.500] Since the last one started, the number
| of times I've eaten has decreased. [00:06.500 -->
| 00:11.000] If I get too carried away with the last one, I'll get
| hungry and do it. [00:11.000 --> 00:14.500] I don't have
| time to eat. [00:15.500 --> 00:18.000] I'm going to eat
| now. [00:20.000 --> 00:23.000] It's going to take about 10
| minutes from here. [00:23.000 --> 00:31.000] It's been a
| while since I've had my last meal. [00:31.000 -->
| 00:36.000] I feel like I'm losing myNu Zi Li . [00:36.000
| --> 00:39.000] I have to go back to my original self.
| [00:39.000 --> 00:44.000] I have to get ready and go to bed.
| [00:44.000 --> 00:46.000] It's not good. [00:46.000 -->
| 00:51.000] I've been drinking a lot lately, so I'm going home.
| [00:51.000 --> 00:53.000] I have to get my nails done this fall.
| [00:53.000 --> 00:54.000] Halloween nails. [00:54.000 -->
| 00:57.000] Halloween, Halloween, Halloween. [00:57.000 -->
| 00:59.000] I'm going to the beauty salon today. [00:59.000
| --> 01:02.000] I'm going to get my nails done the day after
| tomorrow. [01:02.000 --> 01:10.000] I used to look at a
| lot of clothes, but I stopped looking at them. [01:10.000
| --> 01:12.000] I'm going crazy. [01:12.000 --> 01:22.000]
| My stomach's stopped in the middle of summer.
| adeptima wrote:
| translation is not the strongest part. transcription looks very
| good.
| magicalhippo wrote:
| It's struggling with Norwegian. Which I guess isn't shocking.
| The large model performs a fair bit better than the small,
| though neither is "good".
|
| Though I assume the amount of Norwegian it has been exposed to
| is fairly limited, so in that light I'm actually impressed as
| well.
|
| I tried it on a news segment from the radio[1], this is the
| large model output: [00:14.000 --> 00:17.200]
| En skamlos krenking av FN pakten. [00:17.200 -->
| 00:24.000] USAs president og verdensledere svarer pa den
| russiske presidentens atomtrusler og krigsmobilisering.
| [00:25.500 --> 00:29.400] Arbeidsklaer som er ment til a vaere
| til begge kjonn, har det med a vaere tilpasset.
| [00:29.400 --> 00:33.400] Men hvordan ville det gatt, om det
| var motsatt? [00:34.100 --> 00:38.900]
| Dyrevernsorganisasjon vil ha digital merking av regnstyr,
| [00:38.900 --> 00:44.900] men naeringen selv insisterer pa den
| gamle tradisjonsrike maten med rissing av kniv.
| [00:45.600 --> 00:51.400] Mange stromselskaper er positive til
| a tilby kundene fastpris pa strom, og det arevis.
| [00:51.400 --> 00:59.900] Da risikerer de a matte betale mye i
| nettopp aretsvis, sier aktorer som aldri tilbyr fastpris.
| [00:59.900 --> 01:21.900] Dette er onsdagens Dagsnytten. Jeg
| heter Espen As.
|
| For reference, here's what he actually said, from the source[1]
| itself: * En skamlos krenking av FN-pakten.
| USAs president og verdensledere svarer pa den russiske
| presidentens atomtrusler og krigsmobilisering. *
| Arbeidsklaer som er ment a vaere til begge kjonn, er som regel
| tilpasset ... menn. Hvordan hadde det gatt om det var motsatt?
| * Dyrevernsoganisasjon vil ha digital merking av reinsdyr, men
| naeringen selv insisterer pa den gamle tradisjonsrike maten med
| rissing av kniv. * Mange stromselskaper er positive til
| a tilby kundene fastpris pa strom - og det i arevis. -
| Da risikerer de a matte betale mye i nettopp; arevis, sier
| aktor som aldri tilbyr fastpris Dette er onsdagens
| Dagsnytt 18 - jeg heter Espen Aas.
|
| The translation didn't fare that well though:
| [00:14.000 --> 00:17.000] A shameless violation of the UN
| treaty. [00:17.000 --> 00:24.000] The US president and
| world leaders respond to the Russian president's nuclear
| threats and war mobilization. [00:24.000 --> 00:33.000]
| Work clothes that are meant to be for both genders have to be
| suitable, but how would it be if it was the other way around?
| [00:34.000 --> 00:44.000] The animal welfare organization will
| have a digital marking of reindeer, but the industry itself
| insists on the old traditional way of tearing a knife.
| [00:45.000 --> 00:51.000] Many electricity companies are
| positive in offering customers fixed electricity prices, and
| that is annual. [00:51.000 --> 00:58.000] Then they
| risk having to pay a lot in just a year, says an actor who has
| never offered fixed prices. [00:58.000 --> 01:20.000]
| This is Wednesday's Dagsnytt 18. My name is Espen As.
|
| For reference, here's Google Translate's attempt, which is
| pretty good: * A shameless violation of the
| UN Charter. The US president and world leaders respond to the
| Russian president's nuclear threats and war mobilization.
| * Work clothes intended for both sexes are usually adapted to
| ... men. How would it have gone if it had been the other way
| around? * Animal welfare organizations want digital
| marking of reindeer, but the industry itself insists on the
| old, traditional way of marking with a knife. * Many
| electricity companies are positive about offering customers a
| fixed price for electricity - and for years. - Then
| they risk having to pay a lot in precisely; for years, says a
| player who never offers a fixed price This is
| Wednesday's Dagsnytt 18 - my name is Espen Aas.
|
| [1]:
| https://radio.nrk.no/podkast/dagsnytt_atten/l_5ce3e323-97a3-...
| (not sure if it's available outside of Norway)
| kiwih wrote:
| Given this, are there good (and available/open source) models for
| text to speech? Last time I tried everything still sounded
| extremely robotic, and/or were a pain to set up and run. It would
| be fun to set up a pipeline where the two processes
| 'communicate'.
| obscur wrote:
| Measuring performance in rounds of successful Chinese whisper
|
| (irony)
| pen2l wrote:
| Neat, https://github.com/openai/whisper - they have open-sourced
| it, even the model weights, so they are living up to their name
| in this instance.
|
| The 4 examples are stunningly good (the examples have speakers
| with heavy accents, speaking in foreign language, speaking with
| dynamic background noise, etc.), this is far and away better than
| anything else I've seen. Will be super curious to see other folks
| trying it out and seeing if it's as robust as it seems, including
| when confronted with audio speech with natural tics and uhhh's
| and uhmm's and everything in-between.
|
| I think it's fair to say that AI-transcription accuracy is now
| decidedly superior to the average human's, what the implications
| of this are I'm not sure.
| anigbrowl wrote:
| It was already better. I edit a podcast and have > a decade of
| pro audio editing experience in the film industry, and I was
| already using a commercial AI transcription service to render
| the content to text and sometimes edit it as such (outputting
| edited audio).
|
| Existing (and affordable) offerings are so good that they can
| cope with shitty recordings off a phone speaker and maintain
| ~97% accuracy over hour-long conversations. I'm sure it's been
| an absolute godsend for law enforcement other people who need
| to gather poor-quality audio at scale, though much less great
| for the targets of repressive authority.
|
| Having this fully open is a big deal though - now that level of
| transcription ability can be wrapped as an audio plugin and
| just used wherever. Given the parallel advances in resynthesis
| and understanding idiomatic speech, in a year or two I probably
| won't need to cut out all those _uuh like um y 'know_ by hand
| ever again, and every recording can be given an noise reduction
| bath and come out sounding like it was recorded in a room full
| of soft furniture.
| adamgordonbell wrote:
| I've not found that to be the case.
|
| For technical content, I use Rev.com and provide a glossary
| and real humans do the transcript. Other AI transcription
| services get lots wrong because the context often matters.
| Words like "TCP/IP" or "FAT disk format" or "Big Endian" I've
| never found AI so far to handle well.
|
| I'm interested to test out whisper on this one.
|
| https://corecursive.com/063-apple-2001/
| deegles wrote:
| There's already software that can imitate a person's voice,
| so we have all the pieces already to do speech-to-text, clean
| up with GPT-3, and back to text-to-speech in the original
| person's voice. Maybe with a style transfer to keep the
| person's inflections etc the same?
| Karuma wrote:
| I think something similar already exists. See this, for
| example: https://koe.ai/recast/
|
| Although I don't know if they're using anything similar to
| what you suggest. Very cool idea, anyway!
| biomcgary wrote:
| Since you work on podcasts, do any open source transcription
| tools currently identity the speaker in the output? This
| would be particularly helpful for interviews.
| solarmist wrote:
| Any recommendations for particular services?
| anigbrowl wrote:
| I use a service called sonix.ai. It's paid but I think they
| have a free tier or trial period, and it's not very
| expensive. I'm excited about this new OpenAI thing because
| I'd rather do it on my own hardware than send it to the
| cloud, but this company has earned its commercial success.
| solarmist wrote:
| That is an exciting possibility. Being able to fix bad setups
| and missed takes automagically. It's always been possible,
| just expensive and time consuming for moderate improvements.
| thfuran wrote:
| >~97% accuracy over hour-long conversations. I'm sure it's
| been an absolute godsend for law enforcement
|
| 97% accuracy means roughly three or four errors per minute of
| speech. That seems potentially extremely problematic for
| something like law enforcement use where decisions with
| significant impact on people's day and/or life might be made
| on the basis of "evidence".
| gs17 wrote:
| Yeah, I tried to use automated transcription for a research
| project and we had to do it all manually because the few
| errors (I would say it did pretty well given our recording
| quality) were often dropping words like "not", which
| changed the whole meaning of a sentence! It was a useful
| assistance during transcription, but I really hope they
| would verify it was correct before arresting anyone based
| on it.
| anigbrowl wrote:
| No it isn't. That just means 2-3% of your content needs to
| be double-checked by a person at the audio level, saving
| huge amounts of time - equally true of human transcription,
| in which individual words are often [UNINTELLIGEBLE].
|
| Would you want to review this fully before going into
| court, absolutely - because you'd want to play the
| recording to a jury for emotional impact. Can you rely on
| it when you want to quickly read through hours of
| conversation and make decisions about whether to invest
| further resources (which might just mean another hour of
| listening back to the original audio)? Also absolutely.
| Bear in mind that a lot of these errors have little to no
| semantic impact, being on the same level as typos or
| misspellings in a written communication.
|
| Bear in mind too that if law enforcement (honest or not) is
| so interested in you that they're willing to record your
| conversations, your day is already ruined, you just don't
| know it yet. The change here is one of scale rather than
| quality.
| wging wrote:
| Doesn't it mean 100% of your content needs to be double-
| checked? You can't easily identify which 2-3% of your
| content has errors. I'm aware that errors are more likely
| when the model is less confident of its predictions, but
| that shouldn't be enough.
|
| (edit for clarification: errors are not always something
| like "[UNINTELLIGIBLE]", where the system knows it
| doesn't know; they can also be misrecognitions that the
| system believes in with high confidence.)
| woah wrote:
| You double check things that you think are important, in
| this case, passages that will be used as evidence in
| court.
| guelo wrote:
| Maybe you could run the text through a grammar checker to
| identify the errors.
| anigbrowl wrote:
| By the time you're prosecuting someone in court, yes of
| course you double, triple, quadruple check everything.
| That's why lawyers get paid the big bucks (for now...).
| But yes you can identify which content probably has
| errors and flag it as such.
|
| Look, I have decades of experience dealing with human
| speech, and not just as an editor - I can trace the human
| voice from neural impulses in Broca's region through the
| physiology of vocal production, mechanical transduction
| into electrical signals, discrete fourier transforms of
| the resultant waveforms into spectral information and
| back again, the reproduction of altered signals from
| time-aligned speakers to create a sense of
| spatialization, how those are processed in the human ear,
| and how the cilia are connected by nerves back to your
| brain. I'm a good enough editor that I can recognize many
| short words by sight of a waveform, or make 10 edits in a
| row by sight and know it will sound good on playback.
|
| So when I say that machine transcription is as good as
| human realtime transcription now, I say so with the clear
| expectation that those decades of craft are very close to
| being rendered obsolete. I absolutely expect to hand off
| the mechanical part of editing to a machine within 2
| years or so. It's already at the stage where I edit some
| interviews as text, like in a word processor, and then
| export the edited document as audio and it's Good Enough
| - not for every speaker, but more than half the time.
|
| NPR and a lot of commercial broadcasters cut their
| material this way already, because you can get the same
| result from 30 minutes of reading and text editing that
| would require 3 hours of pure audio editing with no
| transcription.
| etienne618 wrote:
| Presumably you can use the 97% that is correctly
| transcribed to rapidly filter out the relevant content.
| This is likely to be only a small portion of the total
| content. Then you check 100% of that.
| datalopers wrote:
| If you know which 2-3% are the false positives, you have
| a very lucrative business model.
| MonkeyMalarky wrote:
| When doing validation, I find it will often be the same
| errors repeated again and again in a transcription. Like
| it will fail on someone or some thing's name (that is
| rare / unique) and map it onto a known similar sounding
| word.
| gnramires wrote:
| I think an [UNINTELLIGIBLE] indication would be a great
| addition to automatic transcription systems.
| inanutshellus wrote:
| It'd [UNINTELLIGIBLE score="92%" alternatives="pro-
| rabble; pourable"]probably[/UNINTELLIGIBLE] be useful to
| make a markup-based output... though you'd probably find
| it gave you more info than you wanted.
| anigbrowl wrote:
| It already exists. The commercial product I use most is
| called sonix.ai and I think they have a free tier or
| trial period. It has shortcomings but it's shockingly
| good, despite having some limitations.
| thfuran wrote:
| >equally true of human transcription, in which individual
| words are often [UNINTELLIGEBLE].
|
| ML systems somewhat notoriously do not necessarily make
| the same sorts of errors that a human would. And I'd
| expect a large portion of the errors to be transcribing
| the wrong words rather that indicating that a word
| couldn't be transcribed. That sort of error means that
| you can't really get away with manually reviewing just 3%
| of the audio.
| golem14 wrote:
| One would think that the few crucial bits of information
| gleaned are listened to manually, and the machine
| translation is not the only thing the judge or a jury sees.
| thfuran wrote:
| You have absolutely ruined someone's day way before
| they're sitting in front of a jury.
| formerly_proven wrote:
| Stuff like that is a very good tell that someone has zero
| experience with law enforcement.
| j-krieger wrote:
| I've worked with similar technology in the law enforcement
| space and the software is never used to make decisions. You
| can make out critical timestamps in conversations and a law
| enforcement officer will always manually confirm the
| software's assessments.
| JohnFen wrote:
| Given that law enforcement has made similar claims about
| technology use in the past that turned out to be false, I
| have no faith in this claim.
| hadlock wrote:
| Microsoft announced their voice transcription technology a
| couple years ago and were also touting ~97-98% accuracy
| which was actually _better_ than human transcription error
| rates. The errors are usually in part people garbling their
| own speech, or they move their head while talking and the
| microphone misses a syllable. Anything in that error bar
| would probably fall under "reasonable doubt"
| kyriakos wrote:
| If its anything like Microsoft teams transcription I
| doubt the 97%+ accuracy.
| soheil wrote:
| Their name reminds of the company McDonald's uses to supply
| their beef called 100% Pure Beef Inc. so they can say 100% Pure
| Beef on their menu.
| space_fountain wrote:
| This seems to not be true for McDonald:
| https://www.snopes.com/fact-check/mcdonalds-100-beef/
| soheil wrote:
| This article seems very suspect to me. This is the main
| reason they assert why the claim is false:
|
| "While this is a fascinating premise, there's nothing to
| it: McDonald's hamburger patties in the U.S. are made with
| 100% USDA-inspected beef. They are cooked and prepared with
| salt, pepper and nothing else; no preservatives, no
| fillers.
|
| McDonald's of Australia's "Make Up Your Own Mind" web site
| said the following of the rumor in its Top FAQs section:
| Is it true that McDonald's created a company called "100%
| Australian Beef" just so they can say that in their
| advertising? No."
|
| So if I'm McDonald's and want to squash a negative story
| why not throw a few bucks at the pinnacle of journalism
| that is Snopes? (formerly Urban Legends Reference Pages)
| space_fountain wrote:
| This isn't exactly a hard story to fact check. There is 0
| evidence for this in either the reddit thread or really
| anywhere? If they were willing to lie about the company
| name why not just lie about the beef in their burgers it
| would be equally scandalous
| soheil wrote:
| The company name could be 100% legit, there is nothing
| stopping you from a forming a company with that name and
| not even sell beef.
| sam_goody wrote:
| It definitely happens.
|
| There are at least two companies that have branded [..]
| Kosher Gelatin(tm). One of them makes gelatin that is
| considered non-kosher by all of the major kashrus
| agencies.
|
| "Kosher Gelatin(r)", when in the ingredients, just means
| the product contains pork.
| jsight wrote:
| You are right, it could be. The problem is that its the
| kind of thing that would be almost impossible to disprove
| if it were false. So you can always raise doubts about a
| supposed disproof.
|
| But it'd be really easy to prove if it were true and
| noone has offered proof. And there've been plenty of
| people who've looked for such proof, afaict.
|
| My default assumption in such cases is that it is likely
| false.
| jefftk wrote:
| If this was more than an urban legend someone would be
| able to dig up a company with this name and some
| indication that McD was working with them.
| pessimizer wrote:
| Something being possible to do isn't enough evidence for
| rational people to believe that it happened. From my
| perspective, it's possible that you're Iron Mike Tyson,
| or that you died after your last comment and this one was
| posted by the assassin who killed you.
| soheil wrote:
| What? I never said it's evidence that it did happen,
| please don't make things up. I just pointed out the
| evidence provided to refute the claim is possibly
| invalid.
| pessimizer wrote:
| You haven't offered any evidence is the point.
| [deleted]
| whichfawkes wrote:
| In the US, for a while I remember we had billboards
| advertising McDonald's burgers as being "1 <hamburger>
| <hamburger>% beef". Because the hamburgers were of course
| circular, it looked kind of like "100%".
|
| I remember thinking that surely an image of a hamburger
| does not legally constitute a zero.
| leobg wrote:
| Seems like this is an urban legend.
|
| https://www.reddit.com/r/IsItBullshit/comments/2rztov/isitbu.
| ..
| soheil wrote:
| This seems to be primarily based on the referenced Snopes
| article https://news.ycombinator.com/item?id=32929237
| [deleted]
| bambax wrote:
| The French version is a little contrived. The speaker is a
| native speaker, but the text is obviously the result of a
| translation from English to French, not idiomatic French.
|
| I will try to put the code to the test, see how it goes.
| octref wrote:
| I'm interested in building something with this to aid my own
| French learning. Would love to read your findings if you end
| up posting it somewhere like twitter/blog!
| bambax wrote:
| Tried again with Blaise Pascal -- the famous fragment of a
| letter where he says he's sorry he didn't have enough time
| to make it shorter.
|
| Original:
|
| > _Mes reverends peres, mes lettres n'avaient pas accoutume
| de se suivre de si pres, ni d'etre si etendues. Le peu de
| temps que j'ai eu a ete cause de l'un et de l'autre. Je
| n'ai fait celle-ci plus longue que parce que je n'ai pas eu
| le loisir de la faire plus courte. La raison qui m'a oblige
| de me hater vous est mieux connue qu'a moi. Vos reponses
| vous reussissaient mal. Vous avez bien fait de changer de
| methode ; mais je ne sais si vous avez bien choisi, et si
| le monde ne dira pas que vous avez eu peur des
| benedictins._
|
| Transcription:
|
| > Mes reves errent peres, mais l'detre navais pas accoutume
| de se suivre de si pres ni d'detre si etendu. Le peu de
| temps que j'sais eu a ete cause de l'de l'de l'de autre.
| J'sais n'detre plus longue que parce que j'sais pas eu le
| loisir de la faire plus courte. La raison qui m'sa obligee
| de me hater vous est mieux connue qu'moi. Vos reponses vous
| reussissaient mal. Vous avez bien fait de changer de
| methode, mais je ne sais pas si vous avez bien choisi et si
| le monde ne dira pas que vous avez eu peur des benedictes.
|
| Here there are many more mistakes, so many that the
| beginning of the text is unintelligible. The language from
| the 17th century is probably too different. Still on the
| "medium" model, as the large one crashes the Colab (not
| sure how to select a beefier machine.)
|
| Still fascinating and exciting though.
| bambax wrote:
| I'm playing with a Colab posted in this thread
| (https://news.ycombinator.com/item?id=32931349), and it's
| incredibly fun and accurate!
|
| I tried the beginning of L'etranger (because you seem to be
| a fan of Camus ;-)
|
| Here's the original:
|
| > _Aujourd'hui, maman est morte. Ou peut-etre hier, je ne
| sais pas. J'ai recu un telegramme de l'asile : << Mere
| decedee. Enterrement demain. Sentiments distingues. >> Cela
| ne veut rien dire. C'etait peut-etre hier._
|
| > _L'asile de vieillards est a Marengo, a quatre-vingts
| kilometres d'Alger. Je prendrai l'autobus a deux heures et
| j'arriverai dans l'apres-midi. Ainsi, je pourrai veiller et
| je rentrerai demain soir. J'ai demande deux jours de conge
| a mon patron et il ne pouvait pas me les refuser avec une
| excuse pareille. Mais il n'avait pas l'air content. Je lui
| ai meme dit : << Ce n'est pas de ma faute. >> Il n'a pas
| repondu. J'ai pense alors que je n'aurais pas du lui dire
| cela. En somme, je n'avais pas a m'excuser. C'etait plutot
| a lui de me presenter ses condoleances._
|
| Here's the transcription:
|
| > Aujourdhui, maman est morte, peut etre hier, je ne sais
| pas. J''ai recu un telegramme de l''asile. Mere decedee,
| enterrement demain, sentiment distingue. Cela ne veut rien
| dire. C''etait peut etre hier.
|
| > L''asile de Vieillard est a Maringot, a 80 km d''Alger.
| Je prendrai l''autobus a deux heures et j''arriverai dans
| l''apres midi. Ainsi, je pourrai veiller et je rentrerai
| demain soir. J''ai demande deux jours de conge a mon patron
| et il ne pouvait pas me les refuser avec une excuse
| pareille. Mais il n''avait pas l''air content. Je lui ai
| meme dit, ce n''est pas de ma faute. Il n''a pas repondu.
| J''ai alors pense que je n''aurais pas du lui dire cela. En
| somme, je n''avais pas a m''excuser. C''etait plutot a lui
| de me presenter ses condoleances.
|
| Except for the weird double quotes instead of the single
| apostrophe ('), it's close to perfect, and it only uses the
| "medium" model.
|
| This is extremely exciting and fun! Happy to try other
| texts if you have something specific in mind!
| bambax wrote:
| Last try for tonight with Baudelaire.
|
| Original: Trois mille six cents fois par
| heure, la Seconde Chuchote Souviens-toi !- Rapide,
| avec sa voix D'insecte, Maintenant dit Je suis
| Autrefois, Et j'ai pompe ta vie avec ma trompe
| immonde ! Remember ! Souviens-toi ! prodigue !
| Esto memor ! (Mon gosier de metal parle toutes les
| langues ) Les minutes, mortel folatre, sont des
| gangues Qu'il ne faut pas lacher sans en extraire
| l'or !
|
| Transcription:
|
| > Trois mille six cents fois par heure, la seconde chuchote
| << Souviens toi >>, rapide, avec sa voix d''insecte,
| maintenant dit << Je suis autrefois >>, et j''ai pompe ta
| vie avec ma trompe immonde. << Remember, souviens toi,
| prodigue, est au memoire, mon gosier de metal, parle toutes
| les langues, les minutes, mortelles folatres, sont des
| gangs qu''il ne faut pas lacher sans en extraire l''or. >>
|
| Not bad! Far from perfect but it's a difficult text.
| Interesting that it works better with Baudelaire than
| Pascal.
| pen2l wrote:
| Interesting, I'm a non-native French speaker, the original
| French piece struck me as being entirely normal (but maybe it
| was just the perfect French accent that swayed me). Can you
| please point out what he said which wasn't idiomatic or
| naturally-worded French?
| bambax wrote:
| Little details. The second sentence is really bizarre:
|
| > _Nous etablissons que l 'utilisation de donnees d'un tel
| nombre et d'une telle diversite est la raison pour laquelle
| le systeme est a meme de comprendre de nombreux accents..._
|
| It doesn't sound natural at all. An idiomatic formulation
| would be more along the lines of:
|
| _Le recours a un corpus [de donnees] si riche et varie est
| ce qui permet au systeme de comprendre de nombreux accents_
| (With 'corpus', 'donnees' is implied.)
|
| Of course this is just an example, and I'm sure other
| French speakers could come up with a different wording, but
| "donnees d'un tel nombre et d'une telle diversite" sounds
| really wrong.
|
| This is also weird and convoluted:
|
| > _Nous distribuons en tant que logiciel libre le code
| source pour nos modeles et pour l 'inference, afin que
| ceux-ci puissent servir comme un point de depart pour
| construire des applications utiles_
|
| It should at least be "le code source DE nos modeles" and
| "servir DE point de depart", and "en tant que logiciel
| libre" should placed at the end of the proposition (after
| 'inference').
|
| Also, "construire" isn't used for code but for buildings,
| and "applications utiles" is unusual, because "utiles"
| (useful) is assumed. "...pour le developpement de nouvelles
| applications" would sound more French.
| [deleted]
| _plg_ wrote:
| At the start, the "Nous etablissons" part, for example. You
| wouldn't write that if you were starting scratch from
| French.
| not_math wrote:
| You can see from the transcript where the model made some
| errors, for example:
|
| > We distribute as a free software the source code for our
| models and for the inference [...]
|
| Should be
|
| > We are open-sourcing models and inference code [...]
|
| Another example
|
| > We establish that the use of such a number of data is
| such a diversity and the reason why our system is able
| [...]
|
| Should be
|
| > We show that the use of such a large and diverse dataset
| leads to improved robustness [...]
| Workaccount2 wrote:
| Can't wait to see twelve new $49.99/mo speech parser services
| pop up in the next few weeks.
| suyash wrote:
| More of this is welcome, they should live up their name and
| original purpose and share other models (code, weights,
| dataset) in the open source community as well.
| Simorgh wrote:
| I've been experimenting with voice-interfaces where typing is
| replaced by talking, but I find it hard to transition users to
| voice - we 'seem' to prefer typing to talking.
|
| I wonder if this will change.
| ironlake wrote:
| Personally, I would rather type than talk when interacting with
| a computer. The only time I use voice interfaces are when the
| physical interface is so poor it's just easier to use voice.
| Apple TV devices are an example of this.
| shpx wrote:
| We shouldn't call this open source. The model definition + the
| data is the source code. The model weights are a compilation
| artifact.
|
| > The source code must be the preferred form in which a
| programmer would modify the program. [...] Intermediate forms
| such as the output of a preprocessor or translator are not
| allowed.
|
| > https://opensource.org/osd
|
| If I asked a programmer from OpenAI to modify the model to better
| support Japanese speakers from Hokkaido, their "preferred form"
| of the model's source code would include the 680,000 hours of
| audio used to train the model.
|
| Yes that means that there are almost no open source models and
| yes it's awesome that they released this and made the weights
| available. Just don't call it open source.
| lfmunoz4 wrote:
| sergiotapia wrote:
| Does this work with multiple speakers?
|
| I want to build a tool that takes a video and generates subtitles
| for it, then I want to index the subtitles and let people search
| for a specific quote to scrub to that part of the video using
| automatically generated urls.
|
| This is for a specific fandom of a ton of content, lots of dirty
| audio mostly recorded in a gym setting with multiple people
| speaking.
| 867-5309 wrote:
| pretty sure such a tool made HN front page a few months ago
| isoprophlex wrote:
| Really incredible to see that their multilingual audio-to-English
| approach is viable. I'm super excited about this, and great to
| see that openai actually open up about something, for once.
|
| Skimming the codebase I can't immediately see code to do
| additional training.
|
| Being able to fine-tune the model to a specific language or case
| (eg. teach it specifically about some technical topic that might
| not be so prevalent in the current train set) would be majorly
| disruptive to current SOTA in "callcenter analytics" tech.
| Especially when combining Whisper with GPT3.
| samstave wrote:
| AI speech recognition FN scares the heck out of me...
|
| for so many reasons.
|
| But one that really pisses me off is not being able to turn it
| off on the iphone, and the fact that aside from "hidden cameras
| in my airBnB" -- soon we will have to worry about secret
| listening machines EVERYWHERE
| jfoster wrote:
| Also, based on their demo, this model seems like it might have
| comprehension well above the level of a typical human.
|
| Anyway, it's out there now. No way to turn back.
| ma2rten wrote:
| We will see an explosion of AI capabilities in the next couple
| of years. This will have a huge impact on our lives, much of it
| good but some of it also bad.
| samstave wrote:
| "Good" for ensuring you're a compliant consumer - bad if
| you're an individual person
| wongarsu wrote:
| "Secret listening machines everywhere" was a pretty big thing
| in East Germany. It's also the central theme of the movie The
| Lives of Others.
|
| Of course, the ability to scale this more cheaply (throwing
| more compute at it, instead of more people) is somewhat scary,
| but it's not really introducing a new capability. Especially
| since you still have to do something with the transcript. An
| AirBnB landlord who reads the transcript of what you said could
| as well have listened to the recording.
| ALittleLight wrote:
| I think it's a new capability to add good speech to text,
| search, and models that can understand and process text. You
| have microphones recording speech everywhere, models turning
| that speech into easily searchable text, and something like
| GPT-3 reading all the speech and raising red flags for any
| transgressive idea you please.
| samstave wrote:
| Yes, and if you want AI that is searching for "dissenters"
| we shall soon have "speech police" or tickets or some
| format of authoritarian punitive actions powered by this
| zappy42 wrote:
| "John Spartan, you have been fined one credit for
| violation of the Verbal Morality Statute."
| jffry wrote:
| I'd argue that cheap, pervasive, always-on surveillance with
| a backlog of searchable transcriptions is a qualitatively
| different capability.
| samstave wrote:
| Exactly.
|
| We are entering the next era...
|
| The Kurzweil podcast appearance on Lex Fridman is nuts and
| while I love kurzweil, holy crap even with my distopian
| outlook he makes it even worse when you listen to even half
| of it...
| gareth_untether wrote:
| I'm thinking of releasing a plugin in for Unity to that can be
| used to match a phrase to an action. Seeing Whisper is making me
| think I should include a way to use voice and not just text.
| aidenn0 wrote:
| I just threw a random rock MP3 at it, and a first readthrough
| shows no transcription errors; this is quite good.
|
| Now I just want OCR that's even 50% as good as this...
| aidenn0 wrote:
| Ran a few other songs through it and found one obvious
| mistranscription:
|
| "He's the bedroom cosmic rocker" (should be "He's the veteran
| cosmic rocker" in _Veteran Cosmic Rocker_ by The Moody Blues)
|
| I also noticed that it's a little on the conservative side for
| detecting speech; all songs were missing at least part of one
| line.
| funhighway wrote:
| Would be nice to give more details about the provenance and
| construction of the training data.
| [deleted]
| StevenWaterman wrote:
| That example at the top of the page (speed talking) blew me away.
| He started talking, I was stunned for a minute, then realised
| yes, it really was English, and I just burst out laughing.
|
| That's so, so far beyond the previous state-of-the-art, it's
| absurd.
| londons_explore wrote:
| @dang Can we change the link to the github here[1]?
|
| It seems to describe the project better for a technical audience.
|
| [1]: https://github.com/openai/whisper
| toss1 wrote:
| Like every model I've seen there is something like this:
|
| >>A decoder is trained to predict the corresponding text...
|
| Prediction of expected text in the context of the previous text.
|
| While this is valuable in casual transcription, it can be
| extremely dangerous in serious contexts.
|
| From personal experience, having given a deposition with an "AI"
| transcription, it will literally reverse the meanings of
| sentences.
|
| This is because it produces the _EXPECTED_ output in a context,
| and _NOT THE ACTUAL OUTPUT_.
|
| Like a speaker that clips the output, these types of systems
| 'clip' the really valuable information out of a transcription.
| Worse yet, this is a completely silent failure, as the transcript
| _LOOKS_ really good.
|
| Basic info theory shows that there is more information contained
| in 'surprising' chunks of data than in expected ones. These
| systems actively work to substitute 'expected' speech to
| overwrite 'surprising' speech.
|
| The transcript I got was utter trash, multiple pages of errata I
| had to submit when the normal is a couple of lines. And as I
| said, some literally reversed the meaning in a consequential way,
| and yet completely silently.
|
| This kind of silent active failure mode is terrifying. Unless it
| is solved, and I see no way to solve it without removing ALL
| predictive algos from the system, these types of systems must not
| be used in any situation of serious consequence, at least not
| without real redundancy and backup.
| sowbug wrote:
| I knew there was a reason why I kept my MP3 library even after
| subscribing to Spotify. Now piping everything through whisper. So
| far the generated lyrics are reasonable, though it thinks the REM
| song says "Linnie Bruce is not afraid."
|
| No surprise that it appears to have successfully transcribed all
| the recordings of Harvard Sentences I could find.
| https://en.wikipedia.org/wiki/Harvard_sentences
| hijp wrote:
| Anyone get it running on m1 mac?
|
| I keep getting `ModuleNotFoundError: No module named
| 'setuptools.command.build'`
| kif wrote:
| I got requirements installed, but then when running the Python
| example, I get:
|
| RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
| kif wrote:
| Probably need to pass some kind of options when initializing.
| The command itself works fine, just shows a warning:
| warnings.warn("FP16 is not supported on CPU; using FP32
| instead")
| mewse-hn wrote:
| using this in the sample code worked for me:
|
| >>> options = whisper.DecodingOptions(fp16=False)
| dceddia wrote:
| Yep, I had this too. `pip3 install -U pip setuptools` took care
| of it. (If you get an error about pip3, try `pip` instead)
| hijp wrote:
| I'm really new to pip, but does this look ok?
|
| (after running the command for setuptools) Defaulting to user
| installation because normal site-packages is not writeable
| Requirement already satisfied: pip in
| /Users/xxx/Library/Python/3.9/lib/python/site-packages
| (22.2.2) Requirement already satisfied: setuptools in
| /Users/xxx/Library/Python/3.9/lib/python/site-packages
| (65.3.0)
|
| ---- after trying whisper installation: x Getting
| requirements to build wheel did not run successfully. | exit
| code: 1 +-> [20 lines of output] Traceback (most recent call
| last): File "/Users/xxx/Library/Python/3.9/lib/python/site-
| packages/pip/_vendor/pep517/in_process/_in_process.py", line
| 363, in <module> main() File
| "/Users/xxx/Library/Python/3.9/lib/python/site-
| packages/pip/_vendor/pep517/in_process/_in_process.py", line
| 345, in main json_out['return_val'] =
| hook(*hook_input['kwargs']) File
| "/Users/xxx/Library/Python/3.9/lib/python/site-
| packages/pip/_vendor/pep517/in_process/_in_process.py", line
| 130, in get_requires_for_build_wheel return
| hook(config_settings) File "/Library/Developer/CommandLineToo
| ls/Library/Frameworks/Python3.framework/Versions/3.9/lib/pyth
| on3.9/site-packages/setuptools/build_meta.py", line 154, in
| get_requires_for_build_wheel return self._get_build_requires(
| File "/Library/Developer/CommandLineTools/Library/Frameworks/
| Python3.framework/Versions/3.9/lib/python3.9/site-
| packages/setuptools/build_meta.py", line 135, in
| _get_build_requires self.run_setup() File "/Library/Developer
| /CommandLineTools/Library/Frameworks/Python3.framework/Versio
| ns/3.9/lib/python3.9/site-packages/setuptools/build_meta.py",
| line 150, in run_setup exec(compile(code, __file__, 'exec'),
| locals()) File "setup.py", line 2, in <module> from
| setuptools_rust import Binding, RustExtension File "/private/
| var/folders/lj/7x6d3dxd3cbdtt484k6xsmyh0000gn/T/pip-build-
| env-ieaydl8r/overlay/lib/python3.9/site-
| packages/setuptools_rust/__init__.py", line 1, in <module>
| from .build import build_rust File "/private/var/folders/lj/7
| x6d3dxd3cbdtt484k6xsmyh0000gn/T/pip-build-env-
| ieaydl8r/overlay/lib/python3.9/site-
| packages/setuptools_rust/build.py", line 23, in <module> from
| setuptools.command.build import build as CommandBuild # type:
| ignore[import] ModuleNotFoundError: No module named
| 'setuptools.command.build' [end of output]
| note: This error originates from a subprocess, and is likely
| not a problem with pip.
|
| error: subprocess-exited-with-error
| dceddia wrote:
| Nope, that doesn't look good! I honestly just googled the
| error and installing setuptools fixed it for me, but I
| barely know anything about the Python ecosystem so I'm
| really just fumbling around here.
| hijp wrote:
| haha same, thanks
| Smaug123 wrote:
| I'm still not successfully using the GPU, but it's working
| decently quickly (with the base model - it's incredibly slow to
| use the Large model) using just the CPU. I'm going to have to
| check what magic stable-diffusion is doing to enable the GPU :(
| dceddia wrote:
| There's a --device flag you can pass. I've been trying to get
| `--device cuda` to work on my Windows machine and it's saying
| that torch wasn't compiled with CUDA. Trying to figure out
| what's going on there.
|
| And on the M1, supposedly PyTorch has support for hardware
| acceleration using MPS (Metal Performance Shaders, announced
| here https://pytorch.org/blog/introducing-accelerated-
| pytorch-tra...) but when I tried `--device mps` it blew up
| with an error "input types 'tensor<1x1280x3000xf16>' and
| 'tensor<1xf32>' are not broadcast compatible".
| Smaug123 wrote:
| Yep, same for me, on M1 after enabling MPS (with
| `model.to("mps")`) it just either SIGSEGV or SIGABRTs every
| time with that line. The extremely unclean nature of the
| abort is making it hard to debug :(
| dceddia wrote:
| I noticed the size seems to correspond to the model. With
| a large model, the error is tensor<1x1280x3000xf16>. With
| tiny, it's tensor<1x384x3000xf16>, and with medium it's
| tensor<1x1024x3000xf16>. It also seems like a bad thing
| that those are f16's but the "expected" data is f32.
| Smaug123 wrote:
| I'm giving up for the night, but
| https://github.com/Smaug123/whisper/pull/1/files at least
| contains the setup instructions that may help others get
| to this point. Got it working on the GPU, but it's...
| much much slower than the CPU? Presumably due to the
| 'aten::repeat_interleave.self_int' CPU fallback.
|
| Also hitting a nice little PyTorch bug:
|
| > File "/Users/patrick/Documents/GitHub/whisper/whisper/d
| ecoding.py", line 388, in apply logits[:,
| self.tokenizer.encode(" ") + [self.tokenizer.eot]] =
| -np.inf
|
| > RuntimeError: dst_.nbytes() >= dst_byte_offset INTERNAL
| ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pyto
| rch/aten/src/ATen/native/mps/operations/Copy.mm":200,
| please report a bug to PyTorch.
| nik_s wrote:
| I just tested the model [1] using an RTX3090, trying to translate
| a french text I found here [2].
|
| Some observations:
|
| - The full translation of the 6:22 minute video takes about 22
| seconds (17x real time)
|
| - It recognizes the language by default (and did a good job to
| recognize it was french audio)
|
| - MIT License [3]!
|
| - The quality of the transcription is good, but not perfect.
|
| - The quality of the translation (if you don't consider
| transcription errors as a translation error) is generally very
| good.
|
| ---
|
| The transcription:
|
| > Bonjour a tous, <error>j'suis</error> espere que vous allez
| bien, c''est ENTI. Et aujourd', <error>aujourd',</error> on se
| retrouve <error>un peu physique</error> pour parler de la termo
| dynamique. Vous ne vous inquietez pas, ca va bien se passer. On
| va y aller ensemble, <error>etre a par exemple,</error> je vous
| accompagne a travers une serie de videos pour vous expliquer les
| principes de base en termo dynamique. Et bah, c''est parti, on va
| y aller tranquillement. Lidee, c''est vous puissiez comprendre la
| termo dynamique dans son ensemble. Donc, je vais vraiment prendre
| mon temps pour <error>couplisser</error> bien comprendre les
| notions,
|
| The translation:
|
| > Hello everyone, I hope you're doing well, it's NT and today we
| find ourselves a little physical to talk about the thermo
| dynamic. Don't worry, it's going well, we're going to go together
| and be the same. I'm going to accompany you through a series of
| videos to explain the basic principles in thermo dynamic. Well,
| let's go, <error>we're going to go quietly</error>. The idea is
| that you can understand the thermo dynamic <error>in sound
| together</error>. So I'm really going to take my time to
| understand the notions,
|
| ---
|
| All in all very happy that OpenAI is publishing their models. If
| Stable Diffusion is any guide, people will hack some crazy things
| with this.
|
| [1] https://github.com/openai/whisper [2]
| https://www.youtube.com/watch?v=OFLt-KL0K7Y [3]
| https://github.com/openai/whisper/blob/main/LICENSE
| seszett wrote:
| > _dans son ensemble_
|
| > _in sound together_
|
| That's hilarious and honestly, incredibly bad. "Dans son
| ensemble" is a very common idiom (meaning "as a whole") while
| "in sound together" has to be pretty rare. "Son" means
| "his/hers/its" as well as "sound", and the former meaning is
| probably more common in general so I have no idea how this
| result could arise.
|
| "Termo" also doesn't exist in French, it's "thermo", so the
| transcript even makes orthographic errors.
|
| And I forgot about "couplisser" which is also a hilarious made-
| up word that sounds like it could mean something, but doesn't!
| _Edit_ Google finds exactly one reference of this, in a patent
| with a typo on the word "coulisser".
|
| I'm still impressed by the transcript quality since it covers
| many languages, but the translation part is quite poor.
| StevenWaterman wrote:
| Was this with the `base` model? `large` is running ok on a P100
| in colab, but is about 4% the speed of `base.en`. Certainly
| seems like some of these models will be fast enough for real-
| time.
| joshcryer wrote:
| It also runs well on a CPU and seems to have proper memory
| management. Wonderful timing because I was using DeepSpeech for
| some audio recordings and it required me to script up a
| splitter to make the files into .wav and then do snippets of 10
| seconds each. Everything about this just works out of the box.
| On a core i5 I'm getting about 30 seconds every minute.
| Transcriptionist jobs just turned into editor jobs. I love how
| it drops the inflections in the audio as well, because it was
| trained on transcription work, and that is one of the first
| things you learn to do (drop the uhs and ums and huhs etc,
| unless it is a strictly verbose transcription).
| solarmist wrote:
| Is it translation or transcription? Or both?
|
| Both, wow. This is really interesting.
| StevenWaterman wrote:
| Both, the blog covers it in detail. Pass in audio in any
| language, and get an English transcription out.
| nik_s wrote:
| It can do both - I've edited my original post to show the
| translation task.
| gok wrote:
| Comparing this model's word error rates to the state of the art
| [1] on a few common test sets:
| Whisper SoTA LibriSpeech test-clean 2.7% 1.8%
| LibriSpeech test-other 5.6% 2.9% Switchboard
| 13.1% 4.9% CallHome 15.8% 9.5%
|
| The authors do explicitly state that they're trying to do a lot
| of fancy new stuff here, like be multilingual, rather than
| pursuing just accuracy.
|
| [1] https://github.com/syhw/wer_are_we
| lunixbochs wrote:
| I suspect Whisper is more robust than other "SOTA" models, but
| this release is likely leaving a fair bit of accuracy on the
| table considering the amount of resources OpenAI is capable of
| throwing at training it.
|
| Comparing the readily available test sets from the paper to
| some of my personal robust models (for the Talon models, this
| is greedy decoding, no language model):
| Talon Talon Talon Whisper wav2vec 2.0
| 28M 300M 1B Large 960h librispeech clean
| 3.21 2.52 2.40 2.7 2.7 librispeech other
| 8.21 6.56 5.63 5.6 6.2 common voice
| 13.88 11.65 8.86 9.5 29.9 tedlium
| 7.51 6.55 5.47 4.0 10.5
|
| I have a battery of more difficult tests on hand (including
| adversarial tests, and diverse accent-specific metrics). I'll
| look at running these tests on each of the Whisper model sizes
| and following up with a larger comparison.
| allanrbo wrote:
| Talon was the first thing that came to my mind when I saw
| this news. Would be nice if it could benefit from Whisper.
| (Big fan of your work on Talon!)
| ma2rten wrote:
| I'm looking forward to your comparison. It's really hard to
| make sense of how good this model actually is without being
| an expert in the area.
| nshm wrote:
| It is interesting how they compare with wav2vec2 instead of
| nemo conformer (which is more accurate) in Table 2.
| StevenWaterman wrote:
| One of the things they point out is that the SoTA on e.g.
| LibriSpeech is _only_ good at LibriSpeech, and doesn 't
| generalise as well.
|
| > Because Whisper was trained on a large and diverse dataset
| and was not fine-tuned to any specific one, it does not beat
| models that specialize in LibriSpeech performance, a famously
| competitive benchmark in speech recognition. However, when we
| measure Whisper's zero-shot performance across many diverse
| datasets we find it is much more robust and makes 50% fewer
| errors than those models.
| lunixbochs wrote:
| My own experience agrees: the generally available "SOTA"
| models are not especially robust, and can be _extremely_ bad
| (>50% absolute error rate) at some tasks. I'll post some
| preliminary numbers in a sibling comment and look into
| running my full set of tests on Whisper.
|
| It looks like Whisper is probably leaving a lot of accuracy
| on the table, but initially it does seem to be a lot more
| robust than general "SOTA" models.
|
| For a quick comparison, Silero's accuracy charts are kind of
| nice because they post results for a large variety of
| datasets. Scroll down to the EN V6 xlarge EE model (not the
| xlarge CE) [1]
|
| [1] https://github.com/snakers4/silero-models/wiki/Quality-
| Bench...
| jawadch93 wrote:
| LanternLight83 wrote:
| Hoping to see this out to use in open source voice assistants,
| eg. mycroft
| liminalsunset wrote:
| I really wish I had this about half a year ago when I was
| building a tool to automatically turn online school lectures into
| searchable, clickable transcripts (kind of like YouTube or EdX
| transcripts).
|
| I was originally using Adobe Premiere Pro's speech to text to do
| it, and wrote Python to convert its output to the Hyperaudio
| format on GitHub. With this, I can totally skip all of that step
| and this is fully open source, too.
|
| App idea:
|
| Build an app that takes a video and uses Hyperaudio or a similar
| project to add a clickable and searchable transcript (clicking in
| transcript seeks video)
| resoluteteeth wrote:
| You could already do the speech recognition in a fully open
| source way with vosk easily, although Whisper may be more
| accurate
| throwamon wrote:
| Is it feasible to use this for Talon-like voice-driven computer
| usage?
| FloatArtifact wrote:
| Maybe, a number of speech recognition engines have been
| integrated into https://github.com/dictation-toolbox/dragonfly
| dubeye wrote:
| I know a manual transcription company, which is still seeing
| modest growth from existing clients who also use ASR, so it's not
| quite there yet
| londons_explore wrote:
| I wonder how much the 30 second window is impacting performance?
|
| Anecdotally, I feel like there are plenty of times that I need
| context from more than 30 seconds ago to understand some
| technical jargon that's being discussed.
| chrisstanchak wrote:
| Hold on to your papers
| smusamashah wrote:
| How well does it do for technical and domain oriented speech? For
| example I have audio recordings of a senior explaining some very
| technical aspects of our software. Will it understand the
| technical terms in that speech?
|
| I guess I will need to download and run on it to see how correct
| it is.
| emcq wrote:
| Be wary of using this model - the licensing of this model seems
| sketchy. Several of the datasets used for training like WSJ and
| TED-LIUM have clear non-commercial clauses. I'm not a lawyer but
| releasing a model as "MIT" seems dubious, and hopefully OpenAI
| has paid for the appropriate licenses during training as they are
| no longer a research-only non profit.
| nshm wrote:
| I think they didn't use WSJ for training, only for evaluation.
| Paper includes WSJ under "Evaluation datasets"
| jefftk wrote:
| This is a big dispute right now: OpenAI and other AI companies
| generally take the position that models learning from data does
| not make the output of the models a derivative work of that
| data. For example, GitHub Co-pilot uses all publicly available
| GitHub code regardless of license, and
| DALLE-2/StableDiffusion/etc use lots of non-free images. I
| don't think this has been challenged in court yet, and I'm very
| curious to see what happens when it is.
| petercooper wrote:
| I think it might be even less problematic with something like
| Whisper than with DALLE/SD? Merely consuming data to train a
| system or create an index is not usually contrary to the law
| (otherwise Google wouldn't exist) - it's the _publication_ of
| copyright content that 's thorny (and is something you can
| begin to achieve with results from visual models that include
| Getty Photos logo, etc.)
|
| I think it'd be a lot harder to make a case for an accurate
| audio to text transcription being seen to violate the
| copyright of any of the training material in the way a visual
| could.
| emcq wrote:
| This is even slightly more direct: access to WSJ data
| requires paying LDC for the download, and the pricing varies
| depending on what institution / license you're from. The cost
| may be a drop in the bucket compared to compute, but I don't
| know that these licenses are transferable to the end product.
| We might be a couple court cases away from finding out but I
| wouldn't want to be inviting one of those cases :)
| zeagle wrote:
| It would be exceptional to get a healthy competitor to
| microsoft/nuance's dragon monopoly on voice recognition in
| healthcare. At a couple thousand bucks a license and the more
| recent SaaS subscription trend there is a lot of money to be made
| in that space.
| darkpicnic wrote:
| I just wrote a script with Hazel to automatically transcribe my
| voice notes to txt. It handles punctuation extremely well. What a
| wonderful contribution!
| abidlabs wrote:
| Here [1] is a video tutorial on building a web UI that accepts
| microphone input and runs it through Whisper for speech
| transcription
|
| [1]
| https://www.youtube.com/watch?v=ywIyc8l1K1Q&ab_channel=1litt...
| amrrs wrote:
| Thank you for sharing!
___________________________________________________________________
(page generated 2022-09-21 23:00 UTC)