[HN Gopher] Whisper - open source speech recognition by OpenAI
       ___________________________________________________________________
        
       Whisper - open source speech recognition by OpenAI
        
       Author : _just7_
       Score  : 1577 points
       Date   : 2022-09-21 16:16 UTC (1 days ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | wongarsu wrote:
       | > About a third of Whisper's audio dataset is non-English, and it
       | is alternately given the task of transcribing in the original
       | language or translating to English. We find this approach is
       | particularly effective at learning speech to text translation and
       | outperforms the supervised SOTA on CoVoST2 to English translation
       | zero-shot.
       | 
       | That's intriguing. You can just set the model to transcribe
       | everything into English, no matter which language the speaker is
       | using, and it just works. Given that many people are much better
       | at understanding English than at speaking it, this might make
       | voice interfaces much more accessible without much work.
        
       | FloatArtifact wrote:
       | This would be a cool thing to integrate into Dragonfly
       | https://github.com/dictation-toolbox/dragonfly
        
         | synkarius wrote:
         | It would. I wonder how this compares with Kaldi, one of the two
         | open source speech recognition engines that Dragonfly currently
         | supports.
        
       | rexreed wrote:
       | I'd love to find a way to test this with longer audio but I don't
       | have GPU resources and not exactly sure how to load that into the
       | Colab. Is anyone planning on hosting or sharing a model that can
       | be used by others to test longer form audio (for podcast
       | transcription)?
        
       | londons_explore wrote:
       | I've never seen transcription and translation combined into a
       | single step like this before...
       | 
       | Have I been living under a rock, or is this new?
       | 
       | I assume it should help performance, because it means emphasis,
       | timing and tone can be used to inform the translation. Helps make
       | better guesses about information missing from the source
       | language.
        
       | jerpint wrote:
       | I recorded myself speaking French and was able to translate
       | decently well on my laptop. Very impressive!
        
       | jfoster wrote:
       | It seems like OpenAI are finally living up to their name for once
       | with this release? Anything I'm missing?
       | 
       | From what I can gather:
       | 
       | 1. Includes model weights. I can't find the URL, but they
       | reference them enough and have a CLI tool, so I presume I just
       | haven't found them yet.
       | 
       | 2. Includes code: https://github.com/openai/whisper
       | 
       | 3. Released under MIT License:
       | https://github.com/openai/whisper/blob/main/LICENSE
        
         | thesausageking wrote:
         | It's one model and in a non-strategic area where there are
         | existing open source projects (Kaldi, DeepSpeech, ...).
         | 
         | For a company that raised $1B, that's not exactly living up to
         | their name and original mission.
        
           | blagie wrote:
           | Yes. The same is true of many products from many companies.
           | 
           | I feel bad about GPT-3 and DALL-E being released under the
           | terms they were, but I don't feel bad about this. I'm not
           | going to condemn OpenAI for the good things they did, but I
           | will hold them accountable for bad things or good ones they
           | didn't do.
           | 
           | I'd given up on OpenAI being open or ethical, but this is a
           | start. It took them down from "evil super-villain" status to
           | mere villain.
        
           | whimsicalism wrote:
           | > It's one model and in a non-strategic area where there are
           | existing open source projects (Kaldi, DeepSpeech, ...).
           | 
           | I can already tell this is much better than any of the
           | existing open source projects with the exception of the wav2*
           | sequence of projects and potentially nvidia's nemo.
        
             | thesausageking wrote:
             | Kaldi is an open, pluggable framework and is a ton more
             | flexible and powerful than this. It's used by hundreds of
             | teams, including a number of consumer tech companies you've
             | heard of. They're not going to move to this over it.
             | 
             | Especially because ASR is a living organism. You have to
             | constantly update your language model as new people, ideas,
             | and words move into the normal lexicon. As people start
             | talking about "COVID", "metaverse", "king charles", or
             | whatever new things that happen, these need to be added to
             | your language model. You need these updates monthly at a
             | minimum and OpenAI didn't release the raw data which means
             | you can't retrain it even if you wanted to spend the
             | time/resources to.
             | 
             | So, this is an interesting research project and helpful for
             | small teams and side projects, but it's unlikely it makes
             | any real impact on the industry.
        
               | whimsicalism wrote:
               | Kaldi just is not fast or high quality enough compared to
               | other modern alternatives like wav2letter. I appreciate
               | that it is more flexible than this, it certainly is - but
               | I am not so sure about "powerful."
        
               | [deleted]
        
         | StevenWaterman wrote:
         | (Model weights from
         | https://github.com/openai/whisper/blob/main/whisper/__init__...
         | )
         | 
         | "tiny.en": "https://openaipublic.azureedge.net/main/whisper/mod
         | els/d3dd5..."
         | 
         | "tiny": "https://openaipublic.azureedge.net/main/whisper/models
         | /65147..."
         | 
         | "base.en": "https://openaipublic.azureedge.net/main/whisper/mod
         | els/25a85..."
         | 
         | "base": "https://openaipublic.azureedge.net/main/whisper/models
         | /ed3a0..."
         | 
         | "small.en": "https://openaipublic.azureedge.net/main/whisper/mo
         | dels/f953a..."
         | 
         | "small": "https://openaipublic.azureedge.net/main/whisper/model
         | s/9ecf7..."
         | 
         | "medium.en": "https://openaipublic.azureedge.net/main/whisper/m
         | odels/d7440..."
         | 
         | "medium": "https://openaipublic.azureedge.net/main/whisper/mode
         | ls/345ae..."
         | 
         | "large": "https://openaipublic.azureedge.net/main/whisper/model
         | s/e4b87..."
        
           | mmastrac wrote:
           | Large is 3GB to save everyone a click. Tiny is 72MB.
        
             | anigbrowl wrote:
             | That's unexpectedly lightweight - enough to run in some
             | phones.
        
               | yencabulator wrote:
               | However, https://github.com/openai/whisper#available-
               | models-and-langu... says requires ~1 GB VRAM.
        
         | solarmist wrote:
         | This kind of model is harder to abuse, so I guess it passed
         | their internal checks much more easily.
         | 
         | I can understand not releasing GPT-3, even if I disagree with
         | the decision.
        
           | ignoramous wrote:
           | > _This kind of model is harder to abuse, so I guess it
           | passed their internal checks much more easily._
           | 
           | The version I choose to believe: _stability.ai_ ate DALL-E
           | for lunch, and that woke them up.
        
             | solarmist wrote:
             | This is probably also true.
        
           | jfoster wrote:
           | True. The potential of GPT-3 to cause internet mayhem was/is
           | significant. I would argue that the mere act of announcing it
           | was still a catalyst for an eventual GPT-3-like model being
           | released. In revealing it, they established a target for what
           | open source models could aim to achieve, and simultaneously
           | got bad actors thinking about ways to abuse it.
        
             | zarzavat wrote:
             | It was a credible argument when GPT-3 was released. But now
             | there are open models that are as capable as GPT-3 and that
             | mayhem has not materialized, with the possible exception of
             | GPT-4chan. They could release it now under a non-commercial
             | license, if they cared to.
        
               | jfoster wrote:
               | Can you provide an example of an open model as capable as
               | GPT-3?
               | 
               | I know there's some "mini-GPT" type models around, but
               | they don't seem nearly as capable.
        
           | dwohnitmok wrote:
           | > I can understand not releasing GPT-3, even if I disagree
           | with the decision.
           | 
           | Why do you disagree?
        
             | bigyikes wrote:
             | I don't see how GPT-3 is any more dangerous than Stable
             | Diffusion, Photoshop, that fake news website the crazy
             | person you're friends with on Facebook really likes, or any
             | of the number of other tools and services that can be used
             | to generate or spread fake information.
        
               | jfoster wrote:
               | All of your examples are limited in some way, but GPT-3
               | wouldn't have any meaningful limits.
               | 
               | Stable Diffusion: Marks images as AI-generated.
               | (invisible watermark, but still, it's there)
               | 
               | Photoshop: Requires time & effort from a human.
               | 
               | Fake news website: Requires time & effort from a human.
        
               | xkapastel wrote:
               | I wouldn't really say Stable Diffusion marks images as
               | AI-generated. There's a script in the Stable Diffusion
               | repository that will do that, but it's not connected to
               | the model itself in a meaningful way. I use Stable
               | Diffusion a lot and I've never touched this script.
               | 
               | https://github.com/CompVis/stable-
               | diffusion/blob/69ae4b35e0a...
        
               | capableweb wrote:
               | What "script" are you using for doing txt2img? The
               | watermark function is automatically called when you use
               | the CLI in two places, https://github.com/CompVis/stable-
               | diffusion/blob/69ae4b35e0a... and
               | https://github.com/CompVis/stable-
               | diffusion/blob/69ae4b35e0a...
               | 
               | Trivial to remove, I give you that. But AFAIK, the
               | original repository + most forks put the watermark
               | automatically unless you've removed it on your own.
        
               | serf wrote:
               | >Trivial to remove, I give you that. But AFAIK, the
               | original repository + most forks put the watermark
               | automatically unless you've removed it on your own.
               | 
               | almost all of the 'low-vram' variant forks either have an
               | argument to turn off the watermark (it saves a bit of
               | memory) or come with it disabled all together.
        
               | nullc wrote:
               | It would be pretty trivial to have an invisible watermark
               | in GPT3 output-- though you don't really need one: just
               | score text with gpt3 to find out if it was likely gpt3
               | generated or not.
        
               | spullara wrote:
               | SD only does that if you don't delete the line of code
               | that does it...
        
               | [deleted]
        
             | mmh0000 wrote:
             | Because why should the wealthy and connected be the only
             | ones -allowed- have access to such life improving
             | technology?
        
             | solarmist wrote:
             | Two reasons. First, someone else will release something
             | similar. Second, I didn't see a related push from them to
             | work with other in the industry to do something productive
             | towards safety with the time they got by delaying
             | availability of these kinds of models. So it felt
             | disingenuous.
        
               | moyix wrote:
               | Several groups already have. Facebook's OPT-175B is
               | available to basically anyone with a .edu address (models
               | up to 66B are freely available) and Bloom-176B is 100%
               | open:
               | 
               | https://github.com/facebookresearch/metaseq
               | 
               | https://huggingface.co/bigscience/bloom
        
               | solarmist wrote:
               | Yup. I meant when it had just come out.
        
       | bredren wrote:
       | This is dropping right in the middle of Interspeech 2022.
       | 
       | I don't believe OpenAI has anyone presenting at the conference,
       | so presumably this was timed to coincide with that and get buzz
       | at the conference.
       | 
       | Curious how this model compares with foss STT from the startup
       | Coqui.
        
       | Tistron wrote:
       | It understands my Swedish attempts at English really well with
       | the medium.en model. (Although, it gives me a funny warning:
       | `UserWarning: medium.en is an English-only model but receipted
       | 'English'; using English instead.`. I guess it doesn't want to be
       | told to use English when that's all it can do.)
       | 
       | However, it runs very slowly. It uses the CPU on my macbook,
       | presumably because it hasn't got a NVidia card.
       | 
       | Googling about that I found
       | [plaidML](https://github.com/plaidml/plaidml) which is a project
       | promising to run ML on many different gpu architectures. Does
       | anyone know whether it is possible to plug them together somehow?
       | I am not an ML researcher, and don't quite understand anything
       | about the technical details of the domain, but I can understand
       | and write python code in domains that I do understand, so I could
       | do some glue work if required.
        
       | revskill wrote:
       | It's actually better than Google Meet subtitle system.
        
       | blueberrychpstx wrote:
       | This is absolute garbage python as I am neither a python
       | developer, nor a good developer. I was trying to play around with
       | real time transcriptions. However, it does work!
       | 
       | > * recording * done recording Recording saved to file.wav Press
       | enter to transcribe
       | 
       | /Users/laptop/Development/Personal/Public/pythonProject1/venv/lib
       | /python3.9/site-packages/whisper/transcribe.py:70: UserWarning:
       | FP16 is not supported on CPU; using FP32 instead
       | warnings.warn("FP16 is not supported on CPU; using FP32 instead")
       | Detected language: english Goodbye, I need to go pick up my wife.
       | Press enter to start recording
       | 
       | Any improvements welcome here.
       | 
       | ``` # This is a sample Python script.
       | 
       | # Press ^R to execute it or replace it with your code. # Press
       | Double | to search everywhere for classes, files, tool windows,
       | actions, and settings.
       | 
       | def print_hi(name): # Use a breakpoint in the code line below to
       | debug your script. print(f'Hi, {name}') # Press [?]F8 to toggle
       | the breakpoint.
       | 
       | def record_microphone(seconds): import pyaudio import wave
       | CHUNK = 1024         FORMAT = pyaudio.paInt16         CHANNELS =
       | 1         RATE = 44100         RECORD_SECONDS = seconds
       | WAVE_OUTPUT_FILENAME = "file.wav"              p =
       | pyaudio.PyAudio()              stream = p.open(format=FORMAT,
       | channels=CHANNELS,                         rate=RATE,
       | input=True,                         frames_per_buffer=CHUNK)
       | print("* recording")              frames = []              for i
       | in range(0, int(RATE / CHUNK * RECORD_SECONDS)):             data
       | = stream.read(CHUNK)             frames.append(data)
       | print("* done recording")              stream.stop_stream()
       | stream.close()         p.terminate()              wf =
       | wave.open(WAVE_OUTPUT_FILENAME, 'wb')
       | wf.setnchannels(CHANNELS)
       | wf.setsampwidth(p.get_sample_size(FORMAT))
       | wf.setframerate(RATE)         wf.writeframes(b''.join(frames))
       | wf.close()              return WAVE_OUTPUT_FILENAME
       | 
       | if __name__ == '__main__': seconds = 5 while True: print("Press
       | enter to start recording") input() filename =
       | record_microphone(seconds) print("Recording saved to " +
       | filename) print("Press enter to transcribe") input() import
       | whisper model = whisper.load_model("base")
       | result = model.transcribe(filename)
       | print(result["text"])
       | 
       | ```
        
       | yawnxyz wrote:
       | Oh man I remember LOVING Micro Machines as a kid.
       | 
       | But also, this tool seems much better than Otter.ai, which gets
       | every third word wrong when transcribing microbiology recordings
        
       | alexb_ wrote:
       | Combine the translation + transcription with voice synthesis, and
       | once compute power allows for this to be miniaturized we will be
       | able to have babel-fish technology in real life.
        
       | no1youknowz wrote:
       | This is awesome. But I really want the other way.
       | 
       | To be able to give it text and hear the speech. A TTS (text to
       | speech).
       | 
       | As a language learner, the ability to create my own sentences
       | (based on existing ones I have, in changing a word here or
       | there). Would be amazing.
       | 
       | How long till we have this I wonder. I know I could use a service
       | to do this currently. But having something running locally, I'd
       | prefer.
       | 
       | Hopefully someone in the OpenAI team reads this. :)
        
         | freedomben wrote:
         | Likewise, TTS is what I really want. My goal is to be able to
         | create audio books from text. I've been using Amazon Polly and
         | it's acceptable quality, but I would be ecstatic to be able to
         | do it locally on my own hardware.
        
           | visarga wrote:
           | Check out NaturalReader. It has hundreds of amazing voices, a
           | system for highlighting text as it is being read, works on
           | books (pdf) and webpages, and is available on phones and in
           | browsers on all platforms. So I could have the same voice on
           | Mac, Linux and iPhone.
        
         | TaylorAlexander wrote:
         | I suspect this is coming. I mean we do have decent text to
         | speech systems already, but in this vein of "we used neural
         | networks and now it's very very good" you can imagine that with
         | something like GPT-3, to extend it they could use this speech
         | to text system so you could speak to it for input, and then a
         | natural progression is that it can use text to speech to return
         | the output, so you just have a voice oriented conversational
         | system.
         | 
         | So I think TTS is a logical part of the system. I also think
         | that there are peculiarities of voice interaction that aren't
         | captured in text training datasets, so they would need to do
         | some fine tuning on actual voice conversation to make it feel
         | natural.
         | 
         | All in due time I suppose.
        
           | visarga wrote:
           | A full NLP system would include speech recognition, TTS, a
           | large language model, and a vector search engine. The LM
           | should be multi modal, multi language and multi task, "multi-
           | multi-model" for short haha. I'm wondering when we'll have
           | this stack as default on all OSes. We want to be able to
           | search, transcribe, generate speech, run NLP tasks on the
           | language model and integrate with external APIs by intent
           | detection.
           | 
           | On the search part there are lots of vector search companies
           | - Weaviate, Deepset Haystack, Milvus, Pinecone, Vespa, Vald,
           | GSI and Qdrant. But it has not become generally deployed on
           | most systems, people are just finding out about the new
           | search system. Large language models are still difficult to
           | run locally. And all these models would require plenty of RAM
           | and GPU. So the entry barrier is still high.
        
             | TaylorAlexander wrote:
             | Ah very interesting thank you. I'm not familiar with
             | research in to vector search, I'll look that up.
             | 
             | But yeah you make a good point about LLMs being too large
             | to run on a normal PC. I do somewhat suspect that we might
             | see some rapid acceleration in the size of neural network
             | processors as large models begin to offer more utility. I
             | think for now they have limited appeal but we're already
             | seeing things like Tesla's Dojo make large leaps in
             | capability to rapidly process complex networks.
             | 
             | In five to ten years we may see built in accelerators come
             | standard in most computers capable of running very complex
             | models. Already Apple provides ever more powerful
             | accelerators in their phones. You could imagine Adobe
             | offering real time diffusion models as part of Photoshop,
             | among other things.
        
       | noreally_ wrote:
       | A notebook is available to try with your microphone on Colab
       | here: https://colab.research.google.com/drive/1nBZ-
       | pDIaIi3N1DIIXvJ...
       | 
       | I'm surprised by the quality on non-English languages, given that
       | 80+% of the training data is English, and the rest is split
       | between tens of languages.
        
         | bambax wrote:
         | Thanks! I played with this in French and posted the results as
         | replies to this comment:
         | https://news.ycombinator.com/item?id=32928643
         | 
         | It's sometimes close to perfect, and sometimes goes off the
         | rail; I think that maybe the model tries to establish some sort
         | of consistency for each sentence; if starts wrong for the first
         | few words of a sentence, it can't build the rest properly.
         | 
         | But it's super fun.
        
         | berberous wrote:
         | How do you get this to translate instead of just transcribe?
        
           | paraschopra wrote:
           | Just specify language and record an audio in another
           | language.
           | 
           | >result = model.transcribe("audio.wav", language="english")
        
             | berberous wrote:
             | That actually seems to set the language for it to
             | transcribe (as opposed to it guessing), with the following
             | triggering a translation to English:
             | 
             | result = model.transcribe("audio.wav", task="translate")
             | 
             | But your post helped me figure out the above, so thank you!
        
           | tekacs wrote:
           | To be more specific than the above:
           | 
           | 1. Make sure you're using a model that isn't suffixed with
           | `.en` (`base`, not `base.en). 2. Use
           | `model.transcribe(your_input_audio, language='Japanese',
           | task='translate')` ... with the appropriate input language.
        
       | goffi wrote:
       | Really interesting, I can see ton of potential uses.
       | 
       | 2 questions:
       | 
       | 1) how does it compare to state of the art FOSS solutions? I'm
       | seeking about DeepSpeech or Vosk
       | 
       | 2) would it be somehow possible to associate timestamp to the
       | words recognized? That would be amazing for things such as audio
       | editing or skipping to a particular location on a video
        
         | nshm wrote:
         | You properly mentioned timestamps. There are many other
         | important properties of good ASR system like vocabulary
         | adaptability (if you can introduce new words) or streaming. Or
         | confidences. Or latency of the output. Compared to Vosk models
         | this model can not work in streaming manner, so not very
         | suitable for real-time applications.
         | 
         | But in general the model is robust and accurate and trained on
         | the amount of speech we never dreamed about in Vosk. We will
         | certainly benefit from this model as a teacher (together with
         | others like gigaspeech models). I recently wrote about it
         | https://alphacephei.com/nsh/2022/06/14/voting.html
        
         | goffi wrote:
         | > goffi
         | 
         | for 2), it's actually written in the description: "phrase-level
         | timestamps", so it should be possible (phrase level is neat for
         | skipping to a special location on a video, but maybe not for
         | audio editing).
        
         | catfan wrote:
        
       | IceWreck wrote:
       | Is there a list of system requirements somewhere ? Can it run on
       | cheaper low memory GPUs ? maybe CPUs ?
        
         | yjftsjthsd-h wrote:
         | On my ancient desktop it happily fell back to running on CPU
         | just fine.
        
         | StevenWaterman wrote:
         | Their models range from 70mb to 3gb. The largest model is
         | smaller than the optimised stable diffusion. Not sure what the
         | inference speed is like, haven't tried it myself yet.
        
           | IceWreck wrote:
           | I just tested it myself. Its fast enough on colab, couple of
           | seconds but not sure if its fast enough to transcribe
           | realtime audio yet.
        
             | lynguist wrote:
             | "small" runs in realtime on Macbook Air M1 CPU.
        
             | MacsHeadroom wrote:
             | Colab is using one of the larger models. Tiny probably runs
             | in realtime on a single core of an RPi.
        
       | [deleted]
        
       | mewse-hn wrote:
       | I know this isn't a tech support forum but maybe someone here
       | knows. I'm attempting the sample python code from the github and
       | _almost_ get a transcription running on my work laptop without a
       | GPU, but I run into this error message:
       | 
       | >>> result = whisper.decode(model, mel, options)
       | 
       | Traceback (most recent call last):
       | 
       | [snip]
       | 
       | RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
       | 
       | It looks like a Torch error, is there some twiddling with
       | "options" I can do to get it to run?
        
         | mewse-hn wrote:
         | I seem to have worked around it by tweaking the "options" line
         | from the sample code to this:
         | 
         | >>> options = whisper.DecodingOptions(fp16=False)
        
         | ignite wrote:
         | I am running on work laptop not using GPU. (I'm running in
         | docker). I just get                   warnings.warn("FP16 is
         | not supported on CPU; using FP32 instead")
         | 
         | And it works.
        
       | arpankapoor wrote:
       | I tried it out on a Hindi speech
       | (https://www.youtube.com/watch?v=4EpfJxKyosE). The transcription
       | starts off decent, but kind of gets stuck repeating the same
       | thing at the 02:40 mark:                   [00:00.000 -->
       | 00:10.000]  pcaas taal meN hmne prgtii kiye, isse ko iNttaar
       | nhiiN kr sktaa /          [00:10.000 --> 00:20.000]  chunaao ke
       | dauraan vott maaNgte hue, srkaar kii niitiyoN pr ktthor se ktthor
       | prhaar krte hue,         [00:20.000 --> 00:28.000]  aur puraanii
       | srkaar kii niitiyoN nhiiN aalocnaa krne ke lie laik bhut saamgrii
       | thii /          [00:28.000 --> 00:35.000]  hr jge maiNne ye khaa
       | ki maiN un logoN meN se nhiiN huuN, jo pcaas vrc kii uplddyoN pr
       | paanii phir de /          [00:35.000 --> 00:43.000]  aisaa krnaa
       | desh ke purssaarth pr paanii phirnaa hogaa /  aisaa krnaa desh ke
       | kisaan ke saath anyaay krnaa hogaa /          [00:43.000 -->
       | 01:01.000]  mlduur ke saath jaattii krnii hogaa /  aam aadmii ke
       | saath bhii vo acchaa vyohaar nhiiN hogaa /  jo svaal aaj mn meN
       | ucchaa hai aur ucchnaa caahii hai /  aadaavii ko pcaas saath hone
       | aaye, hm jaintii mnaane jaa rhe haiN /          [01:01.000 -->
       | 01:18.000]  aaj desh kii stitii kyaa hai /  hm pichr ke hoge haiN
       | /  prgtii kii dodd' meN, jo desh hmaare saath aajaad hue the, vo
       | hm se aage bddh' ke /  jo desh hmaare baac jn meN the, vo hmeN
       | piice chodd' the /          [01:18.000 --> 01:34.000]  duniyaa ke
       | grii tm deshoN meN hmaarii gdd'n aaye /  viis phiij'ii se jaanaa
       | lo griibii kii rekaa ke niice /  raaktptii mhudaay ke vibhaashn
       | meN gaauuN kaa ullek haiN naa piire kaa paanii nhiiN /
       | [01:34.000 --> 01:50.000]  hm praathmii shikssaa anivaare nhiiN
       | kr skte haiN /  lddkiyoN kii shikssaa kii upekssaa ho rhii haiN /
       | lddki kaa jnm lenaa to is desh meN abhii tk ek abhishaap hai /
       | [01:50.000 --> 02:07.000]  kyaa srkaarii kdm utthaakr smaaj meN
       | jaagdRtii paidaa krkeN /  kyaa sb logoN ko juttaakr ye to aisaa
       | kaam hai jis meN koii dlbNdii ke lie isthaan nhiiN /  hm desh kaa
       | nkssaa nhiiN bdl skte haiN /  desh meN saadhnoN kii kmii nhiiN
       | hai /          [02:07.000 --> 02:07.000]  aur saadhnoN kii agr
       | kmii hai to usko tthiik dnt se praapt kiyaa jaa sktaa hai /
       | saadhn bdd'aae bhii jaa skte hai /  lekin jo saadhn haiN unkaa
       | tthiik upyog nhiiN ho rhaa /  jNtaa ke upr tteks lgaakr jo dnni
       | kptaa kiyaa jaataa hai /  uskaa laag jNtaa tk nhiiN phu
       | [02:37.000 --> 02:37.000]  rkhkm jaatii hai /  videshii baiNko
       | meN dn jaane kaa silsilaa abhii tk kyoN kaaeN hai /  usko lokne
       | ke lie kyaa kdm utthaaege /  hm videshii puujii ke lie
       | praitrshiil haiN videshii puujii aae aur agr videshii puujii
       | aatii hai acche dnt kii ttek         [03:07.000 --> 03:07.000]
       | acche dnt kii puujii aatii hai acche dnt kii puujii aatii hai
       | acche dnt kii puujii aatii hai acche dnt kii puujii aatii hai
       | [03:37.000 --> 03:39.000]  acche dnt kii puujii aatii hai acche
       | dnt kii puujii aatii hai         [04:07.000 --> 04:09.000]  acche
       | dnt kii puujii aatii hai acche dnt kii puujii aatii hai
       | [04:37.000 --> 04:39.000]  acche dnt kii puujii aatii hai acche
       | dnt kii puujii aatii hai
       | 
       | The translation does a much better job however:
       | [00:00.000 --> 00:10.000]  In the last 50 years, we have made
       | progress, no one can deny this.         [00:10.000 --> 00:20.000]
       | During the elections, while asking for votes, while attacking the
       | government's policies harshly,         [00:20.000 --> 00:28.000]
       | and to criticize the policies of the old government, a lot of
       | material was needed.         [00:28.000 --> 00:35.000]
       | Everywhere, I have said that I am not one of those people who
       | pour water on the fruits of 50 years.         [00:35.000 -->
       | 00:39.000]  To do this, we will have to pour water on the efforts
       | of the country.         [00:39.000 --> 00:43.000]  To do this, we
       | will have to do injustice with the farmers of the country.
       | [00:43.000 --> 00:45.000]  We will have to do caste with the
       | laborers.         [00:45.000 --> 00:50.000]  Even with the common
       | man, that will not be a good behavior.         [00:50.000 -->
       | 00:55.000]  The question that arises in the mind today and should
       | arise,         [00:55.000 --> 01:01.000]  Freedom has come to be
       | 50 years, we are going to celebrate.         [01:01.000 -->
       | 01:04.000]  What is the situation of the country today?
       | [01:04.000 --> 01:07.000]  Why did we get separated?
       | [01:07.000 --> 01:14.000]  In the race of progress, the country
       | that got freedom along with us, they went ahead of us.
       | [01:14.000 --> 01:19.000]  The country that was after us, they
       | left us behind.         [01:19.000 --> 01:25.000]  In the poorest
       | countries of the world, they counted us.         [01:25.000 -->
       | 01:29.000]  20% of the population is below the poverty line.
       | [01:29.000 --> 01:35.000]  In the speech of the President, there
       | is no mention of villages or drinking water.         [01:35.000
       | --> 01:39.000]  We cannot enforce primary education.
       | [01:39.000 --> 01:43.000]  The education of girls is being
       | neglected.         [01:43.000 --> 01:50.000]  The birth of a girl
       | is still a curse in this country.         [01:50.000 -->
       | 01:55.000]  Is it by taking government steps, by creating
       | awareness in the society?         [01:55.000 --> 02:01.000]  Is
       | it by uniting all the people that there is no place for party?
       | [02:01.000 --> 02:05.000]  Can't we change the map of the
       | country?         [02:05.000 --> 02:08.000]  There is no shortage
       | of resources in the country.         [02:08.000 --> 02:14.000]
       | And if there is a shortage of resources, it can be obtained in
       | the right way, resources can be increased.         [02:14.000 -->
       | 02:21.000]  But the resources that are there, they are not being
       | used properly.         [02:21.000 --> 02:30.000]  The wealth that
       | is collected by taxing the public, its profit does not reach the
       | public, it does not reach the common man.         [02:30.000 -->
       | 02:32.000]  Where does it go?         [02:32.000 --> 02:35.000]
       | Whose pockets are filled?         [02:35.000 --> 02:39.000]
       | Whose treasury does that money go to?         [02:39.000 -->
       | 02:44.000]  Why is the chain of money going to foreign banks
       | still established?         [02:44.000 --> 02:47.000]  What steps
       | have been taken to stop it?         [02:47.000 --> 02:52.000]  We
       | are motivated for foreign worship, foreign worship has come.
       | [02:52.000 --> 03:01.000]  And if foreign worship comes for good
       | technology, for infrastructure,         [03:01.000 --> 03:06.000]
       | for education, then no one will object.         [03:06.000 -->
       | 03:11.000]  I believe that our communist friends will not object
       | either.         [03:11.000 --> 03:19.000]  But is the maximum use
       | of the resources in the country happening?         [03:19.000 -->
       | 03:26.000]  Is it not true that corruption has become a national
       | disease?         [03:26.000 --> 03:31.000]  I remember that
       | Swargi Rajiv Gandhi had said in a speech that I send one rupee
       | from Delhi,         [03:31.000 --> 03:36.000]  but where I send
       | the rupee, as I reach there, 19 paise are left.
       | [03:36.000 --> 03:41.000]  I asked him how this miracle happens.
       | [03:41.000 --> 03:47.000]  Bhaskar said that when the rupee runs,
       | it shrinks.         [03:47.000 --> 03:54.000]  The rupee shrinks,
       | it gets into the hand, it goes into the pocket, it becomes small.
       | [03:54.000 --> 03:58.000]  It is difficult to recognize the
       | rupee.         [03:58.000 --> 04:02.000]  The rupee can be
       | hidden.         [04:02.000 --> 04:06.000]  The situation of the
       | currency of the country is not good.         [04:06.000 -->
       | 04:10.000]  First, the government expenditure has increased, it
       | is increasing.         [04:10.000 --> 04:17.000]  It needs common
       | consent to reduce without reducing.         [04:17.000 -->
       | 04:24.000]  No one can work in the same way.         [04:24.000
       | --> 04:27.000]  Yes, our old Prime Minister Narasimha Raoji,
       | [04:27.000 --> 04:34.000]  if he would have tried in this
       | direction after stabilizing himself, then he would have
       | succeeded.         [04:34.000 --> 04:47.000]  But he was stuck in
       | some such things that he could not pay attention to these
       | problems.
        
       | O__________O wrote:
       | Anyone know if it is possible to output IPA using this?
       | 
       | International Phonetic Alphabet (IPA)
       | 
       | - https://wikipedia.org/wiki/International_Phonetic_Alphabet
       | 
       | _________
       | 
       | EDIT: Based on list of languages in the tokenizer code here,
       | doesn't appear IPA is supported:
       | 
       | https://github.com/openai/whisper/blob/5f8d4bcc254d4f3e833d3...
        
       | gck1 wrote:
       | Got my hopes high that there's finally an open source solution
       | that can deal with Georgian language, only to get my hopes
       | brutally destroyed. It successfully detects a language and then
       | produces garbage. Passing language manually produced similar
       | results.
       | 
       | Result of my own recording:                 Detected language:
       | georgian        yichiyannaaisw  remnants ts founding hockey slee
       | syi eling bhthwaaicularly qfiithoAPPLAUSEPS thDavPin Dao  pDING
       | Mozai pryadk uk aa orchestral uk aa arter uu BrettM
       | hilarious l ryy ywaa vark pk  *   Poll statements lypson. ch`ch`r
       | uesi[?]meislemveerrshueairelmirisasasssesserersiveesrrilmexre
       | reimimep`emsese
       | 
       | Results of clear Georgian audio [1].
       | 
       | On tiny model:                 Detected language: georgian
       | [00:00.000 --> 00:21.560]  en       [00:21.560 --> 00:23.240] Wo
       | Lun Lun ...       [00:23.280 --> 00:43.720] Wo Lun Lun Lun Lun
       | Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun
       | Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun
       | Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun
       | Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun
       | Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun
       | Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Yin
       | Wei b forestry
       | 
       | On medium model:                 Detected language: georgian
       | sreiresrrrrrrrrrrrrrnsssrrrrree rrirrrrrrrrre
       | rsrngnrrrrsrrrrrrrorrrrrrrrrrr kLbHMHMHMHMHMHMHMHMMHMMMMMMMMMMMMM
       | MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM hMLhM        hMM hMM
       | HMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
       | KklllM Hklll so lll lllll k kbsbk i K H H k l H k lI H m
       | cizar wait h examined pwaadmHa`eM supervision ng
       | ieeeeeeeeeeeeeeeee maee eaeeeeeeeeeeeeeeeeeeee daeeeeeeeeeeeee
       | ueeeeeeeeeeeee ea [?] mii smeii mmiei Yk` siiei        savie
       | siiit`t`iimemi, raee siime siii g'iiiiceiri saeieii siiei si
       | veep`veiiie k`leesheeroeeeeeeeeeeeee. egez
       | eqaksheieeeeeeeeeeeeeeeeeeeeeeeeeeeeea, nrropiroo mmumin
       | seeknp`ee see[?]igosh szhebimeleleekirpie semeime seeimmm
       | seenemeei se t Famose mindeyo hqe bywall jaini threshold ji jani
       | den poder vlogging bywall Take the text Ba        tou yodamj je
       | te shake ba te shake baou contour but whatever Baou cube baou cup
       | Baou rope Baou people Qeful Qeful imiiimibt`mit`iiit`iiiiiiii
       | raoeoooenpeeeieiiiiiiiiiomiiiiiiiii riiiiiiiiiiimii
       | nseeeeeeeeeeeeeee
       | sareeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
       | mjwai[?]v eeeid[?]vv nabdadeb        lmireeeep`eduiveveeeiieeeee
       | rareieeeeveeeeevee sarreeeeeeeeeeeeeeeeeeeeeeeeeee
       | xshiiiiiiiiiiiii liiiiiii liiiiiiiiii liii liiiiiii laiiiii
       | eiiiiiiiiiiiiiii iiii m
       | 
       | I've also tested it on few other audio inputs and it failed to
       | produce meaningful results on all of them with all models.
       | 
       | There was one case with another audio [2] and tiny model, where
       | it got at least some words close to their phonetic values, but
       | printed them in cyrillic instead of Georgian and tried to
       | interpret some Georgian words as Russian:                 whisper
       | audio.wav --language Georgian --task transcribe --model tiny
       | [00:00.000 --> 00:02.000]  <<Zurab Gercha Dzhaparzis Gants
       | Khatevarom       [00:02.000 --> 00:04.000]  umeren tsupasu
       | Khizgeblotu kashchepasta       [00:04.000 --> 00:06.000]  a
       | opozatsionermii chlen shonakhlari       [00:06.000 --> 00:07.000]
       | s drorodisat Sakartolom       [00:07.000 --> 00:09.000]  s
       | akutariteritoriia biunda dai bronos       [00:09.000 -->
       | 00:10.000]  ta tasovyi torusi sam kadr       [00:10.000 -->
       | 00:12.000]  Sakartolomshii rovno ukraienistu       [00:12.000 -->
       | 00:13.000]  shchoigo eknebo       [00:13.000 --> 00:14.000]
       | amsiasakheb kirchi metitausu       [00:14.000 --> 00:15.000]
       | khlebisliderma       [00:15.000 --> 00:17.000]
       | utsnoktangadatsema shcheisiaa ugraduntsa       ...
       | 
       | [1] https://www.youtube.com/watch?v=rE_zx_6RhL0 [2]
       | https://www.youtube.com/watch?v=elrXgO8hjtI
        
       | jcims wrote:
       | Did respectably with some mumble rap:
       | https://controlc.com/d353dafb
       | 
       | (some NSFW words in the lyrics obv)
        
         | derangedHorse wrote:
         | Whisper performed a lot better than I would've expected it to!
        
       | mmh0000 wrote:
       | Okay this is super impressive. I just downloaded Whisper and fed
       | it a random flac file I had handy and it did a really good job.
       | Also impressive that it works on my weak CPU:
       | 
       | A 3m07s flac took 5m to transcribe:                 $ whisper
       | --device cpu 'BLACKPINK - BORN PINK/01 Pink Venom.flac'
       | Detecting language using up to the first 30 seconds. Use
       | `--language` to specify the language       Detected language:
       | korean       [00:00.000 --> 00:10.000]  Blackpink
       | [00:11.000 --> 00:14.000]  Kick in the door, wave in the coco
       | [00:14.000 --> 00:16.000]  pabkonineun cinge ggyeodeul saenggag
       | malgo       [00:16.000 --> 00:19.000]  I talk to talk, run ways I
       | walk walk       [00:19.000 --> 00:21.000]  him gamgo pab pab an
       | bwado ceog       [00:21.000 --> 00:24.000]  By one and two by two
       | [00:24.000 --> 00:26.000]  nae songgeut du hanae tamyeon ajieun
       | jung       [00:26.000 --> 00:30.000]  gas jasyo jigeum hwaryeohae
       | T makes no sense       [00:30.000 --> 00:32.000]  You couldn't
       | get a dollar out of me       [00:33.000 --> 00:38.000]  ja oneul
       | bamiya nuntobeul pumgo       [00:38.000 --> 00:41.000]  mihoneul
       | bbaeseum down       [00:41.000 --> 00:43.000]  Look what you made
       | us do       [00:43.000 --> 00:47.000]  ceonceonhi neol jamjaeul
       | paieo       [00:48.000 --> 00:52.000]  jami nal mankeum
       | areumdaweo       [00:52.000 --> 00:53.000]  I bring the pain like
       | [00:53.000 --> 00:57.000]  diseutab, paengpaeng, diseutab,
       | paengpaeng, diseutab, paengpaeng, paengpaeng       [00:57.000 -->
       | 00:58.000]  Get em, get em, get em       [00:58.000 -->
       | 01:00.000]  Straight till you don't like       [01:00.000 -->
       | 01:01.000]  Whoa, whoa, whoa       [01:01.000 --> 01:03.000]
       | Straight till you don't like       [01:03.000 --> 01:04.000]  Ah,
       | ah, ah       [01:04.000 --> 01:05.000]  Taste that, pink venom
       | [01:05.000 --> 01:06.000]  Taste that, pink venom
       | [01:06.000 --> 01:08.000]  Taste that, pink venom
       | [01:08.000 --> 01:09.000]  Get em, get em, get em
       | [01:09.000 --> 01:11.000]  Straight till you don't like
       | [01:11.000 --> 01:12.000]  Whoa, whoa, whoa       [01:12.000 -->
       | 01:13.000]  Straight till you don't like       [01:13.000 -->
       | 01:14.000]  Ah, ah, ah       [01:14.000 --> 01:15.000]  Blackpink
       | and Amo       [01:15.000 --> 01:17.000]  Got it by the smack ram
       | [01:17.000 --> 01:18.000]  But rest in peace       [01:18.000 -->
       | 01:19.000]  Please light up a candle       [01:19.000 -->
       | 01:20.000]  This the knife of a vando       [01:20.000 -->
       | 01:22.000]  Messed up and I'm still in saline       ...SNIP...
        
         | lunixbochs wrote:
         | Looks like it defaults to the model called "small".
         | 
         | I just ran some benchmarks - M1 Max, pytorch, with a 1.29
         | second flac (looks like the matrix math was running on a single
         | thread):                   tiny         146.522ms detect_lang
         | 549.131ms decode_one         0.057ms tokenizer
         | base         354.885ms detect_lang         1046.679ms
         | decode_one         0.011ms tokenizer              small
         | 803.892ms detect_lang         3194.503ms decode_one
         | 0.017ms tokenizer              medium         2279.689ms
         | detect_lang         10128.255ms decode_one         0.023ms
         | tokenizer              large         3656.478ms detect_lang
         | 17249.024ms decode_one         0.016ms tokenizer
        
         | adgjlsfhk1 wrote:
         | For more benchmarks on an rtx 2060 (6gb), the "small" model for
         | me is roughly 10x real-time and the tiny model is 30x real-
         | time.
        
       | lazylion2 wrote:
       | I ran it on this clip
       | 
       | https://clips.twitch.tv/ReliablePopularWerewolfOSkomodo-pcuw...
       | 
       | because... hard accent.
       | 
       | first run whisper thought its welsh so I had to run with
       | --language en , and it did pretty well
       | 
       | https://i.imgur.com/TQiYU9X.png
       | 
       | took 36 seconds in Google colab
        
       | manishsharan wrote:
       | Oh this is a relief to have something opensource in this field. I
       | had using Mozilla Deepspeech for transcribing my voice notes ,
       | often with hilarious to incomprehensible results. DeepSpeech is
       | dead ; so I will be sure to check this out.
        
         | pabs3 wrote:
         | DeepSpeech got spun out of Mozilla to coqui.ai and they are
         | continuing the open nature of the project.
        
       | w10-1 wrote:
       | Naively, training the same model on multiple languages has
       | interesting implications.
       | 
       | On one hand, it may capture something "deeper" about language.
       | 
       | On the other hand, it's likely to do great in general, but miss
       | particularities of some language.
       | 
       | Understanding the coverage of the training model seems a
       | perennial problem. Is there any (shorthand) way to compare
       | language model training corpora?
       | 
       | Clearly if they use common subsets we have a literal comparison.
       | I'm more interested in whether there's progress in characterizing
       | corpora by speech styles, fluency, vocabulary sets, (noise)
       | environment, emotionality, proposition types, etc.
       | 
       | (btw: 25 minutes for a 9-minute segment on a 12-thread x86. Lots
       | of jargon spelled as it sounds. Sentences capitalized but no
       | punctuation. Overall good.)
        
       | dindindin wrote:
       | I'm not in the Speech Recognition circles and am looking for open
       | source speech recognition I can play around with - would this be
       | the new state of the art?
        
         | mercurywells wrote:
         | For me as a deaf person the current state of art (in terms of
         | speed & usability) is the Recorder app on a Google Pixel phone
         | (4a/6 Pro is what I've used)
        
         | StevenWaterman wrote:
         | Yes
        
         | visarga wrote:
         | Most probably
        
       | graderjs wrote:
       | The big question is why is Google's speech recognition in Gboard
       | voice typing still so shit?
       | 
       | https://news.ycombinator.com/item?id=32862172
       | 
       | MIT licensed model seems way better
        
       | The5thElephant wrote:
       | How is it Apple, Google, or Microsoft are not further ahead of
       | the game on speech recognition like this? They have the resources
       | to hire the best ML researchers and throw tons of computing hours
       | at it, yet Siri, Google, and Cortana continue to struggle to get
       | anywhere near this level of comprehension.
        
         | wongarsu wrote:
         | Siri and Cortana have to run at least in real time, with
         | reasonable compute resources. Probably faster than real time
         | when the audio gets shipped off to the cloud and transcribed
         | there. This model can't do that (in the "large" version, which
         | the examples use).
         | 
         | Also, you are comparing Whisper's highlight reel with everyday
         | performance of other models. Nobody shows their weaknesses in
         | their highlight reel.
        
           | alex_marchant wrote:
           | Siri until ios 15 was done in the cloud IIRC.
        
           | coder543 wrote:
           | Someone else in this thread[0] said Whisper was running at
           | 17x real time for them. So, even a weak machine might be able
           | to do an acceptable approximation of real time with Whisper.
           | 
           | Also, I feel like shipping to the cloud and back has been
           | shown to be just as fast as on device transcription in a lot
           | of scenarios. Doing it on device is primarily a benefit for
           | privacy and offline, not necessarily latency. (Although,
           | increasingly powerful smartphone hardware is starting to give
           | the latency edge to local processing.)
           | 
           | Siri's dictation has had such terrible accuracy for me (an
           | American English speaker without a particularly strong
           | regional accent) and everyone else I know for so many years
           | that it is just a joke in my family. Google and Microsoft
           | have much higher accuracy in their models. The bar is so low
           | for Siri that I automatically wonder how much Whisper is
           | beating Siri in accuracy... because I assume it has to be
           | better than that.
           | 
           | I really wish there was an easy demo for Whisper that I could
           | try out.
           | 
           | [0]: https://news.ycombinator.com/item?id=32928207
        
             | lunixbochs wrote:
             | 17x realtime _on a 3090_
             | 
             | I did some basic tests on CPU, the "small" Whisper model is
             | in the ballpark of 0.5x realtime, which is probably not
             | great for interactive use.
             | 
             | My models in Talon run closer to 100x realtime on CPU.
        
               | coder543 wrote:
               | "CPU" isn't necessarily the benchmark, though. Most
               | smartphones going back years have ML inference
               | accelerators built in, and both Intel and AMD are
               | starting to build in instructions to accelerate
               | inference. Apple's M1 and M2 have the same inference
               | accelerator hardware as their phones and tablets. The
               | question is whether this model is a good fit for those
               | inference accelerators, and how well it works there, or
               | how well it works running on the integrated GPUs these
               | devices all have.
               | 
               | Brute forcing the model with just traditional CPU
               | instructions is fine, but... obviously going to be pretty
               | slow.
               | 
               | I have no experience on the accuracy of Talon, but I've
               | heard that most open source models are basically overfit
               | to the test datasets... so their posted accuracy is often
               | misleading. If Whisper is substantially better in the
               | real world, that's the important thing, but I have no
               | idea if that's the case.
        
               | lunixbochs wrote:
               | See https://news.ycombinator.com/item?id=32929029 re
               | accuracy, I'm working on a wider comparison. My models
               | are generally more robust than open-source models such as
               | Vosk and Silero, but I'm definitely interested in how my
               | stuff compares to Whisper on difficult held-out data.
               | 
               | > Brute forcing the model with just traditional CPU
               | instructions is fine, but... obviously going to be pretty
               | slow.
               | 
               | It's not that simple. Many of the mobile ML accelerators
               | are more targeted for conv net image workloads, and
               | current-gen Intel and Apple CPUs have dedicated hardware
               | to accelerate matrix math (which helps quite a bit here,
               | and these instructions were in use in my tests).
               | 
               | Also, not sure which model they were using at 17x
               | realtime on the 3090. (If it's one of the smaller models,
               | that bodes even worse for non-3090 performance.) The 3090
               | is one of the fastest ML inference chips in the world, so
               | it doesn't necessarily set realistic expectations.
               | 
               | There are also plenty of optimizations that aren't
               | applied to the code we're testing, but I think it's
               | fairly safe to say the Large model is likely to be slow
               | on anything but a desktop-gpu-class accelerator just due
               | to the sheer parameter size.
        
               | lunixbochs wrote:
               | Ok, my test harness is ready. My A40 box will be busy
               | until later tonight, but on an NVIDIA A2 [1], this is the
               | batchsize=1 throughput I'm seeing. Common Voice, default
               | Whisper settings, card is staying at 97-100% utilization:
               | tiny.en: ~18 sec/sec       base.en: ~14 sec/sec
               | small.en: ~6 sec sec/sec       medium.en: ~2.2 sec/sec
               | large: ~1.0 sec/sec (fairly wide variance when ramping up
               | as this is slow to process individual clips)
               | 
               | [1] https://www.nvidia.com/en-us/data-center/products/a2/
        
               | coder543 wrote:
               | Isn't the A2 much weaker than a 3090? So those results
               | are promising.
               | 
               | EDIT: for what it's worth, Nvidia rated the A2 at 18
               | TFLOPS of FP16, and Apple rates the current A16 Neural
               | Engine at 17 TFLOPS of FP16. I'm sure it's not an "apples
               | to apples" comparison.
        
               | lunixbochs wrote:
               | If you count the GPU component and memory bandwidth, the
               | Apple M2 is slightly weaker on paper for 16-bit inference
               | than the NVIDIA A2, if you manage to use the whole chip
               | efficiently. The A16 is then slightly weaker than the M2.
               | 
               | Sure, the Whisper Tiny model is probably going to be fast
               | enough, but from my preliminary results I'm not sure it
               | will be any better than other models that are much much
               | faster at this power class.
               | 
               | Whisper Large looks pretty cool, but it seems much harder
               | to run in any meaningful realtime fashion. It's likely
               | pretty useful for batch transcription though.
               | 
               | Even if you hit a realtime factor of 1x, the model can
               | leverage up to 30 seconds of future audio context. So at
               | 1x, if you speak for 10 seconds, you'll potentially need
               | to wait another 10 seconds to use the result. This kind
               | of latency is generally unsatisfying.
        
               | coder543 wrote:
               | EDIT: After writing and posting the original version of
               | this comment, I did an experiment where I dictated it to
               | Siri, and then saved that audio (which was recorded
               | simultaneously), which I then fed to both Whisper's
               | tiny.en and medium.en... Siri did terrible for me.
               | Whisper tiny.en was 100% accurate, as far as I can tell,
               | and the only thing Whisper medium.en did was add a few
               | commas that tiny.en had missed. I actually ended up
               | playing the audio file for Siri as well, and that did not
               | end well either. YMMV, but even the tiny model seems very
               | useful. tiny.en took 17.5 seconds to process the ~1
               | minute audio file, and medium.en took 351 seconds, but I
               | think there is a lot of room for performance optimization
               | on this M2 MBA. The model evaluation was purely using the
               | CPU, not GPU or neural engine, and it wasn't even using
               | all of the CPU cores for whatever reason.
               | 
               | ----
               | 
               | With Siri dictation, I feel like I usually spend at least
               | as much time correcting its mistakes as I do speaking the
               | dictation itself. In some cases, that is still
               | faster/easier than typing, but I would rather have a
               | voice model that can work in about the same _total_
               | amount of time without requiring constant corrections. If
               | I speak for 30 seconds, then I can do other things for 30
               | seconds while my phone processes it... that might
               | actually be preferable if it gets it right. Otherwise,
               | I'll be spending 30 seconds actively editing it anyways.
               | Even an improvement on the number of edits required per
               | dictation would be nice. Admittedly, I feel like Google
               | and Microsoft _already_ do a much better job here.
               | 
               | It could be interesting to use the tiny model to give a
               | preview of the writing while the large model is taking
               | its time, and then allow the user to tap on words that
               | changed to see the predictions from the tiny model and
               | correct back to them if they want. I was doing some
               | experiments a few minutes ago, and on one audio clip, the
               | tiny model wrote down a very literal interpretation of an
               | uncommon sci-fi word, and that was more accurate than
               | either the medium or the large models. The rest of the
               | time, the larger models did better, as expected.
               | 
               | But, I don't know. This is interesting to me, but I agree
               | there could be issues with making is workable for real
               | time transcription.
        
             | MacsHeadroom wrote:
             | > I really wish there was an easy demo for Whisper that I
             | could try out.
             | 
             | Like the colab notebook linked on the official Whisper
             | github project page?
        
               | coder543 wrote:
               | Sure, but I did see one linked in another thread here on
               | HN after posting that comment.
        
           | The5thElephant wrote:
           | Good point about realtime or not, however with ML I have
           | found the weaknesses get addressed pretty fast by someone.
           | There is a big step between proof of concept and practical
           | application though, so we shall see.
        
         | Kuinox wrote:
         | OpenAI is owned by Microsoft FYI.
        
           | neongreen wrote:
           | Is it? Googling suggests that Microsoft invested in OpenAI
           | but doesn't actually own it.
        
             | Kuinox wrote:
             | Oh, my bad looks like they only bought an exclusive license
             | to GPT3.
        
         | fxtentacle wrote:
         | This AI has a 30 second delay on the audio processing because
         | it needs to be able to "look into the future" to get these good
         | results. That 30s delay would be unacceptable for
         | Siri/Google/Cortana.
        
           | coder543 wrote:
           | A lot of models we currently use seem to do the same thing.
           | The model will transcribe a "best effort" interpretation in
           | real time, then as you can continue speaking, you'll see it
           | go back and make corrections. I'm sure you can feed the first
           | X seconds you have into the model, followed by (30-X) seconds
           | of silence, and it will do real time transcription just
           | fine... it would be weird if this broke anything. Then, as
           | you get more speech, you continue getting better
           | transcription of the first 30 seconds, then you switch to a
           | 30 second sliding window.
           | 
           | Maybe I'm missing something, but I don't see the problem
           | here.
        
             | fxtentacle wrote:
             | Yes, that's because Whisper - like pretty much all of them
             | - uses a Transformer encoder with Attention layers. And the
             | Attention layers learn to look into the future.
             | 
             | And yes, what you describe could be done. But no, it won't
             | reduce latency that much, because the model itself learns
             | to delay the prediction w.r.t. the audio stream. That's why
             | ASR-generated subtitles usually need to be re-aligned after
             | the speech recognition step. And that's why there is
             | research such as the FastEmit paper to prevent that, but
             | then it is a trade-off between latency and quality again.
             | 
             | Also, running your "low-latency" model with 1s chunks means
             | you now need to evaluate the AI 30x as often as if you'd be
             | using 30s chunks.
        
               | coder543 wrote:
               | You just said the models pretty much all work the same
               | way, then you said doing what I described won't help. I'm
               | confused. Apple and Google both offer real time, on
               | device transcription these days, so _something_ clearly
               | works. And if you say the models already all do this,
               | then running it 30x as often isn 't a problem anyways,
               | since again... people are used to that.
               | 
               | I doubt people run online transcription for long periods
               | of time on their phone very often, so the battery impact
               | is irrelevant, and the model is ideally running (mostly)
               | on a low power, high performance inference accelerator
               | anyways, which is common to many SoCs these days.
        
               | fxtentacle wrote:
               | I meant that most research that has been released in
               | papers or code recently uses the same architecture. But
               | all of those research papers use something different than
               | Apple and Google.
               | 
               | As for running the AI 30x, on current hardware that'll
               | make it slower than realtime. Plus all of those 1GB+
               | models won't fit into a phone anyway.
        
               | coder543 wrote:
               | > Plus all of those 1GB+ models won't fit into a phone
               | anyway.
               | 
               | I don't think that's a requirement here. I've been
               | playing with Whisper tonight, and even the tiny model
               | drastically outperformed Siri dictation for me in my
               | testing. YMMV, of course.
        
         | beastman82 wrote:
         | In my unmeasured empirical observation Google has amazing
         | speech recognition
        
           | jeffbee wrote:
           | I tried feeding the four examples from this announcement into
           | Google as dictation inputs and it just sits there blankly. On
           | the JFK speech test file in the repo, Google understands
           | perfectly. The samples in the announcement are clearly
           | outside the capabilities of anything Google has launched
           | publicly, but I don't know how that translates to overall
           | utility in every day applications.
        
           | The5thElephant wrote:
           | I agree they have the best compared to Apple, Amazon,
           | Microsoft. However I don't think it is as good as what is
           | being shown here by OpenAI.
        
             | Vetch wrote:
             | My experience with the APIs is Google is excellent and
             | Microsoft is slightly better. And the offline model I've
             | been using that's nearly as good as both is facebook's
             | wav2vec2-large-960h-lv60-self.
             | 
             | Don't believe what's on marketing pages, they rarely
             | transfer to the real world. Will have to make time to try
             | it and see. In theory, given task diversity and sheer
             | number of hours, it should be a lot more robust but will
             | wait on evidence before believing any claims on SoTA.
        
               | KingMob wrote:
               | Weird. I started working on an ASR SaaS in my spare time,
               | and at least on the test podcasts, Google was the _worst_
               | : https://www.sammaspeech.com/blogs/post/speech-
               | recognition-ac...
        
       | RockRobotRock wrote:
       | Dude, this is insane. This is so much better than other speech to
       | text libraries I've tried.
        
       | danso wrote:
       | This is an astonishing package. Every AI voice-to-text model I've
       | tried on "The Wire's" famous "fuck" scene [0] usually fails,
       | because the youtube clip's audio quality is bad and it's a scene
       | with virtually no dialogue except breathing and "Fuck". But
       | Whisper returned impressive results [1]
       | 
       | [0] https://www.youtube.com/watch?v=DS6pE88Xg3s
       | 
       | [1]                   $ yt-dlp --extract-audio --audio-format mp3
       | -o wire-fuck.mp3 https://www.youtube.com/watch?v=DS6pE88Xg3s
       | $ whisper --language en wire-fuck.mp3         [00:00.000 -->
       | 00:02.000]  Oh         [00:13.260 --> 00:15.260]  Fuck
       | [00:15.260 --> 00:31.260]  Motherfucker         [00:50.700 -->
       | 00:52.700]  Fuck         [00:52.700 --> 00:58.700]  Oh
       | [00:58.700 --> 01:10.700]  Fuck         [01:28.700 --> 01:55.900]
       | Fuck         [02:02.340 --> 02:03.700]  Motherfuck.
       | [02:10.220 --> 02:11.220]  Oh, fuck.         [02:11.780 -->
       | 02:12.780]  Oh, fuck.         [02:25.900 --> 02:27.900]  Fuck,
       | fuck, fuck, fuck, fuck, fuck.         [02:27.900 --> 02:28.900]
       | Motherfucker.         [02:32.900 --> 02:33.900]  Oh, fuck.
       | [02:34.900 --> 02:35.900]  Fuck.         [02:35.900 -->
       | 02:36.900]  Oh, fuck.         [02:36.900 --> 02:37.900]  Oh,
       | fuck.         [02:37.900 --> 02:38.900]  Oh, fuck.
       | [02:48.900 --> 02:49.900]  Motherfucker.         [02:53.900 -->
       | 02:54.900]  Fucking A.         [02:54.900 --> 02:56.900]  Mm hmm.
       | [02:56.900 --> 03:12.900]  Fuck.         [03:26.900 -->
       | 03:28.900]  Motherfucker.         [03:28.900 --> 03:32.900]  Fuck
       | me.         [03:58.900 --> 04:01.900]  Oh.         [04:28.900 -->
       | 04:34.900]  Fuck.
        
         | owenpalmer wrote:
         | nsfw
        
       | andy_xor_andrew wrote:
       | Hold on, it does not only speech recognition, but also language
       | translation, in the same model?
       | 
       | What an interesting approach. What benefits does this have over
       | having two dedicated models, one for speech-to-text, and another
       | for translation?
       | 
       | It just seems so odd, given the problems of speech-to-text and
       | Spanish-to-English seems so different from one another (in terms
       | of the problem domain). Seems so unusual to have both handled by
       | one model!
       | 
       | Does knowledge of speech-to-text carry over into knowledge of
       | translation? Does knowledge of translation carry over into
       | knowledge of speech-to-text? So weird.
        
         | ByThyGrace wrote:
         | Judging from the chart in their github README, Whisper performs
         | much better in parsing Spanish audio than any other language
         | and that in particular blows my mind. I would have expected
         | English to be at the top of any such model, it being such an IT
         | lingua franca.
         | 
         | Now I wonder if it works equally well with Spanish from Spain
         | (and its different regions) and Spanish from the New World (and
         | in its myriads of different flavours).
        
         | newhaus1994 wrote:
         | My understanding is that multi-modal models are the primary
         | focus of OpenAI right now, due to their stated goal of
         | achieving AGI. This product is probably better thought of as an
         | offshoot of their work to create a fully generalizable model,
         | rather than a specific attempt to provide
         | translation/transcription services.
        
         | beanlog wrote:
         | It sounds useful to me because you can use tone information to
         | help with the translation, which text-to-text translation can't
         | do. But I'm not sure if that's how this model actually works.
        
         | TaylorAlexander wrote:
         | It seems these days that language-oriented models are commonly
         | becoming multilingual by default. There are a lot of common
         | threads when understanding sentence construction between
         | different languages. French and English have different rules
         | but they will still have things like nouns, adjectives,
         | subjects, prepositions, etc. It seems that by training models
         | on many languages you get both a more robust understanding of
         | language, and it saves you the trouble of having to make many
         | more localized models for every language. I also believe that
         | the other languages help the models construct sentences in
         | languages which have very small training sets. If it has a few
         | examples in a rare language as well as good translations to a
         | better-known language, then it can provide good support for the
         | rare language.
         | 
         | We also see in image generation models that multi-modal
         | networks are more powerful than single purpose networks. As we
         | move towards more advanced AI systems I suspect we will see
         | more and more generalizable networks with distinct advantages
         | over separate networks that get plugged together.
        
           | magicalhippo wrote:
           | Would a multilingual modal perhaps also be better at
           | understanding non-natives speech?
        
             | TaylorAlexander wrote:
             | Good question but I don't know the answer.
        
       | thuttinger wrote:
       | I tried running it in realtime with live audio input (kind of).
       | 
       | If you want to give it a shot, you can find the python script in
       | this repo: https://github.com/tobiashuttinger/openai-whisper-
       | realtime
       | 
       | A bit more context on how it works: The systems default audio
       | input is captured with python, split into small chunks and is
       | then fed to OpenAI's original transcription function. It tries
       | (currently rather poorly) to detect word breaks and doesn't split
       | the audio buffer in those cases. With how the model is designed,
       | it doesn't make the most sense to do this, but i found it would
       | be worth trying. It works acceptably well.
        
         | catfan wrote:
        
           | secret-noun wrote:
           | impressive
        
         | kkielhofner wrote:
         | Haven't tried it yet but love the concept!
         | 
         | Have you thought of using VAD (voice activity detection) for
         | breaks? Back in my day (a long time ago) the webrtc VAD stuff
         | was considered decent:
         | 
         | https://github.com/wiseman/py-webrtcvad
         | 
         | Model isn't optimized for this use but I like where you're
         | headed!
        
           | thuttinger wrote:
           | Interesting. I'll take a look at this, thanks!
        
             | Curiositry wrote:
             | Perhaps this could be adapted?
             | 
             | https://github.com/mozilla/DeepSpeech-
             | examples/blob/master/m...
        
       | Kirkman14 wrote:
       | I've been trying Whisper on my old setup (Mac Pro 2012 running
       | Mojave, with Radeon RX 580), and it's a pretty amazing tool.
       | 
       | Unfortunately my system is not ideal for today's AI tools.
       | Whisper runs only on the CPU, and it's slow.
       | 
       | I know PyTorch recently added Metal support, but only for M-based
       | Macs. Has anyone found a way to make it work with Intel Macs?
        
       | minimaxir wrote:
       | The model output can be tweaked to produce audio embeddings (akin
       | to BERT for text embeddings and CLIP for image embeddings), which
       | can lead to some _interesting_ applications as the previous two
       | examples have demonstrated.
        
         | FerociousTimes wrote:
         | What do you mean exactly by audio embeddings?
        
           | minimaxir wrote:
           | Represent a given set of audio inputs as a numeric vector,
           | which can then for example be finetuned for other ML/AI
           | problems or placed in an embeddings database for easy ANN
           | search with similar audio clips. In the extreme case it could
           | facilitate better AI audio generation similar to how CLIP can
           | guide a VQGAN.
           | 
           | Although the 30 second minimum input is a bit of a bummer
           | since it may not allow much granularity in the resulting
           | embeddings.
        
       | lynguist wrote:
       | How can I use this (or something similar) for live translation? I
       | don't mind if there's a 30s delay.
       | 
       | As in I don't want to input a file, I want to input the
       | microphone sound.
        
         | agnos wrote:
         | Would also like to know this. It looks like they're processing
         | the audio file in 30 second chunks, so a naive approach of
         | keeping a buffer of 30-second input stream chunks and just
         | continually writing to an output .mp3 could work...
        
         | blueberrychpstx wrote:
         | Was wondering the same.
         | 
         | I really wish I would have been paying attention in Unix
         | class...
         | 
         | Something like `microphone | chunk 3s | whisper | stdout` would
         | be SO COOL!!! I think that's possible but too lazy to look
         | more.
        
       | spywaregorilla wrote:
       | Hmm are there any noteworthy open sourced speech to speech
       | models? Like transform a spoken line to another voice, copying
       | both the words spoken and the inflections?
        
       | cercatrova wrote:
       | Their Scottish accent example is pretty good, I'd like to see it
       | work on some very strong English accents like this one:
       | https://www.youtube.com/watch?v=nJ7QB3om-QY
        
         | homarp wrote:
         | Detected language: english
         | 
         | [00:00.000 --> 00:05.400] Gordy and County Kerry are
         | investigating the theft of up to 60 sheep on Mount Brandon.
         | 
         | [00:05.400 --> 00:10.400] One of the farmers is offering a
         | reward for information leading to the return of the use,
         | 
         | [00:10.400 --> 00:12.200] which are worth thousands of euro.
         | 
         | [00:12.200 --> 00:14.200] Well, I'm fine with that.
         | 
         | [00:14.200 --> 00:15.200] That's right.
         | 
         | [00:15.200 --> 00:16.200] Do you own them?
         | 
         | [00:16.200 --> 00:17.200] Anyone can say it.
         | 
         | [00:17.200 --> 00:18.200] Fine with that.
         | 
         | [00:18.200 --> 00:22.720] Last Saturday, Mikey Joe O'Shea
         | brought his flock of Scotch sheep down from the mountain
         | 
         | [00:22.720 --> 00:25.320] commonage ahead of lambing.
         | 
         | [00:25.320 --> 00:29.840] He discovered over 50 were missing,
         | allowing for a number of deaths and
         | 
         | [00:29.840 --> 00:30.840] strays.
         | 
         | [00:30.840 --> 00:34.600] Mikey is convinced over 45 sheep have
         | been stolen.
         | 
         | [00:34.600 --> 00:35.600] It was a good night.
         | 
         | [00:35.600 --> 00:36.600] It would be a full moon there.
         | 
         | [00:36.600 --> 00:37.600] It would be a good night.
         | 
         | [00:37.600 --> 00:38.600] It would be bright out.
         | 
         | [00:38.600 --> 00:40.600] There could be anyone going up in the
         | mountains.
         | 
         | [00:40.600 --> 00:41.600] It would be a good night.
         | 
         | [00:41.600 --> 00:43.600] Well, that was 45 sheep missing.
         | 
         | [00:43.600 --> 00:49.600] Mikey and the lambs and everything in
         | the sheep, they counted out a nice bit of money.
         | 
         | [00:49.600 --> 00:52.200] They've been doing the boat in
         | Nassan.
         | 
         | [00:52.200 --> 00:53.200] It's a big one. [00:53.200 -->
         | 00:54.200] It's a big one. [00:54.200 --> 00:55.200] It's a big
         | one.
         | 
         | [00:55.200 --> 00:59.000] Mikey's next door neighbor says some
         | of his sheep have also been stolen.
         | 
         | [00:59.000 --> 01:00.000] Come back. [01:00.000 --> 01:01.000]
         | Come back. [01:01.000 --> 01:02.000] Come back.
         | 
         | [01:02.000 --> 01:03.000] I've been missing about 10 years.
         | 
         | [01:03.000 --> 01:04.000] It's not all that difficult.
         | 
         | [01:04.000 --> 01:06.320] All they've got to do is have a good
         | dog.
         | 
         | [01:06.320 --> 01:10.560] Have a good dog and go at night, some
         | moonshine night.
         | 
         | [01:10.560 --> 01:11.560] Just put the dog around him.
         | 
         | [01:11.560 --> 01:14.120] Put him on a trailer and walk him.
         | 
         | [01:14.120 --> 01:18.360] And then probably somebody else to
         | pick him up.
         | 
         | [01:18.360 --> 01:29.960] Everybody's doing it north, but he's
         | doing it.
        
           | hegemon8 wrote:
           | Wow!
        
           | cercatrova wrote:
           | Wow that is incredibly impressive. At 0:53 is it translating
           | as well? Didn't sound like English to me.
        
         | mod wrote:
         | Those are Irish.
        
         | angrais wrote:
         | Are you sure? I just ran some of Kimmy's sketches through it
         | and ... The results are garbage.
        
       | biggerChris wrote:
       | We have reached sentient mode.
        
       | howon92 wrote:
       | I just tested it on a few of my YouTube videos in Korean and it's
       | surprisingly good at transcription.
        
       | dom96 wrote:
       | This really makes me want to build a Amazon Echo/Google Nest/etc
       | replacement that's open hardware, open source and most
       | importantly recognises voice completely offline. I find that I
       | don't use these smart devices for much more than setting timers
       | anyway so this seems like an easy project.
       | 
       | I just wonder what system requirements Whisper has and whether
       | there are open source voice recognition models that are
       | specifically built for embedded devices.
        
         | solarkraft wrote:
         | Are you thinking about reimplementing Mycroft?
         | 
         | The Mycroft has done a lot of cool and important work in the
         | field to ship an actual personal assistant product (stuff like
         | wake word detection).
        
           | dom96 wrote:
           | hah, of course someone had the idea already and executed on
           | it. But yeah, basically that but without the screen (probably
           | would go a long way to decrease the cost, $299 is pretty
           | steep for such a device)
        
             | MayeulC wrote:
             | Well, you can always install Mycroft on a Pi, or on your
             | computer.
             | 
             | Almond is also interesting as a voice assistant, though I
             | think it doesn't perform speech recognition itself.
        
             | sheepybloke wrote:
             | One thing they don't touch much on is the STT, as they use
             | models from third parties. You could definitely do
             | something that utilizes this model and then feeds the
             | tokens to some of their parsing code. I've been working on
             | something similar to this, but burned out around adding the
             | STT portion [0].
             | 
             | [0]: https://github.com/Sheepybloke2-0/trashbot - It was
             | called trashbot because the final implementation was going
             | to look like oscar the grouch in a trashcan displaying the
             | reminders.
        
         | suyash wrote:
         | This is only one side of the coin, you still need really good
         | models for Speech Synthesis and then be able to have it all
         | working in almost real time, ideally locally on device.
        
           | ricopags wrote:
           | As far as TTS goes, Mycroft.ai[0] has released a decent
           | offline one.
           | 
           | [0]https://mycroft.ai/
        
         | MacsHeadroom wrote:
         | I really want all this too. The smallest model is ~80mb and the
         | largest is 3gb. Not sure about system requirements yet; but
         | models that small suggest this may be doable locally on a
         | single board computer.
         | 
         | Edit: According to this comment[0] the base model runs in real
         | time on an M1 CPU. The tiny model apparently decodes an audio
         | file twice as fast. These are promising results.
         | 
         | [0] https://news.ycombinator.com/item?id=32927360#32929739
        
           | dom96 wrote:
           | I'd be interested to see how well it performs on something
           | like an RPi. M1 is pretty beefy.
        
           | olao99 wrote:
           | To be more precise the original comment said "M1 Max" which
           | in itself is significantly beefier a bare "M1"
        
           | lunixbochs wrote:
           | For an offline (non-streaming) model, 1x realtime is actually
           | kind of bad, because you need to wait for the audio to be
           | available before you can start processing it. So if you wait
           | 10 seconds for someone to finish speaking, you won't have the
           | result until 10 seconds after that.
           | 
           | You could use really small chunk sizes and process them in a
           | streaming fashion, but that would impact accuracy, as you're
           | significantly limiting available context.
        
       | howon92 wrote:
       | I just tried it in a few Korean YouTube videos and it's
       | surprisingly accurate, to an extent where I would've thought it
       | was done by a human.
        
       | TOMDM wrote:
       | Given how robust it seems to be with fast speech, I wonder if you
       | could save cycles by speeding up the audio before feeding it in.
        
       | eatsyourtacos wrote:
       | Can this be used as a real-time transcription or is it too slow
       | for that?
       | 
       | Curious what anyone is using these days for a real-time
       | transcription. It doesn't have to be perfect, but just good
       | enough.
       | 
       | My kids watch some youtube vidoes where people will make a mod
       | where it converts them talking to text then look for keywords and
       | spawn a boss in Terraria if you say the wrong keyword etc.
       | 
       | I made a clone of that with the .NET System.Speech.Recognition
       | library. It... works.. but my biggest problem is that #1 it waits
       | until you are done speaking to translate to text on the callback,
       | so there was too much of a delay for it to be fun.. the point is
       | that it will be checking a stream of chatter. #2 is the
       | recognition is pretty crap, I mean it's nearly good enough for my
       | silly purpose but it's still pretty bad.
        
         | blueberrychpstx wrote:
         | If your family uses Apple devices, Apple offers free on-device
         | speech recognition. Only caveat is that it needs to be
         | restarted every minute due to whatever stupid limitation (or
         | bug) they've introduced.
         | 
         | https://developer.apple.com/documentation/speech/recognizing...
         | 
         | Also, see `requiresOnDeviceRecognition`
        
           | [deleted]
        
         | [deleted]
        
         | nshm wrote:
         | Try https://github.com/alphacep/vosk-
         | api/blob/master/csharp/demo...
        
         | jayavanth wrote:
         | thuttinger posted in this thread:
         | https://github.com/tobiashuttinger/openai-whisper-realtime
        
         | whimsicalism wrote:
         | It might require too much work for what you are looking for,
         | but the wav2letter library is the best real-time transcription
         | OSS I have found by a considerable margin.
        
           | davidzweig wrote:
           | Out of interest, did you try Nemo?
           | https://github.com/NVIDIA/NeMo
        
             | whimsicalism wrote:
             | No. I dont think it had streaming capabilities when i was
             | doing this test two years ago, although i see it does now.
        
         | NaturalPhallacy wrote:
         | I tried it out and it's way too slow on my machine that is no
         | slouch (Ryzen 9 5950/GTX 3080).
         | 
         | It's doing seconds of translation per minute for me at least.
        
         | TaylorAlexander wrote:
         | The base model seems to run faster than real time on my
         | machine. The "medium" model is larger and runs more slowly -
         | roughly real time or maybe slightly slower.
        
         | suyash wrote:
         | Depends if you're trying to run it offline or over the cloud.
        
       | dot1x wrote:
       | That's all good and great, now please do OCR...
        
       | tgtweak wrote:
       | Good to see them releasing model weights - hopefully now that
       | Stable Diffusion is out they will release Dall-E 2 source and
       | weights as well.
        
       | knaik94 wrote:
       | I got a super weird results with the 'medium' and language
       | Japanese (with a --task translate). The song is False Sympathy by
       | Mondo Grosso.
       | 
       | "[01:17.000 --> 01:32.000] Translated by Releska" when using the
       | translate to english. That entire part of the song is
       | instrumental. This line does not appear at all in the original
       | transcribe only in the opus format rip.
       | 
       | It shows up in the yt rip in format 251 (opus), but not in format
       | 140 (aac from youtube), nor the flac rip. All three are giving
       | different results.
       | 
       | The translation quality is tied to bitrate. Same song converted
       | to different words, the only difference being bitrates and
       | formats. Converting my own rip with the same parameters as yt
       | (opus @140 and then @130) didn't allow me to reproduce this
       | error.
       | 
       | The model hung for a solid extra minute at the end when
       | translating to english, the last 90ish seconds of the song took
       | real time 60 seconds, while the entire rest took about 90. The
       | same behavior was not observed with the transcribe.
       | 
       | Some of the english words are incorrect but that was expected.
       | The first Japanese "mistake" I found was "Quan tehaEr Ren no"
       | instead of "subeteha hutarino". With the left being what whisper
       | wrote. A single random word "hey" was transcribed/translated to
       | english even though it's the singer elongating the Yuan  while
       | singing the Le Yuan . "Luo chiteyuku Er Ren deXi garetaEr Ren
       | noragu HEY" instead of "Luo chiteiku Suo detsunagareta Er Ren
       | noLe Yuan " .
       | 
       | I am using the official subtitles released on the youtube video.
       | 
       | It's a complex Japanese song with both japanese and english, and
       | the original transcribe took about 20 real time seconds to start
       | with the first line, 130 seconds for the whole song. It seems to
       | be showing results in 20 second window increments, but this seems
       | to depend on what it considers audio and what it is throwing
       | away.
       | 
       | On my computer I wasn't able to use the large model because I ran
       | out of VRAM, I have 8gb, not sure how much more it'd require. So
       | I ran it with medium.
       | 
       | The song is False Sympathy by Mondo Grosso. The mv is suggestive,
       | in case that matters. I grabbed a fresh audio rip from Youtube
       | because I didn't want to take it out of my cd case.
       | 
       | https://www.youtube.com/watch?v=B6Y-WsgpzlQ
       | 
       | It is translating this version differently from the director's
       | cut version. I ripped both as opus.
       | 
       | There is something weird about how it is handling the opus
       | encoded version, as I find the same "Translated by Releska" in a
       | wav version transcoded from the opus.
        
         | adeptima wrote:
         | Japanese output will produce lot of tiny mistakes. However the
         | whole output is still good enough. Like 95% plus good enough.
         | 
         | Found lot mistakes in 3-4 characters kanji ... and I guess most
         | native Japanese will do mistakes time to time too, and this is
         | why they pop up lot of buzzwords on screen with all kind of
         | highlighting to avoid double guessing.
        
       | Gazoche wrote:
       | Pretty cool, and it seems to work on AMD GPUs as well. I've just
       | tried it on my RX6800 with the ROCm build of PyTorch.
        
       | amrrs wrote:
       | Here's a live demo on Hugging Face Spaces if you want to try -
       | https://huggingface.co/spaces/Amrrs/openai-whisper-live-tran...
        
         | coder543 wrote:
         | I've tried speaking to that demo several times... I used the
         | built in feature to record from microphone, and I played back
         | the samples to make sure they were audible and clear.
         | 
         | Sometimes it outputs the words "thank you" (which I did not
         | say), sometimes it outputs a period. It never once output
         | anything I said. It seems completely broken.
         | 
         | EDIT: apparently something about the combination of
         | Safari+HF+Whisper was not working. I tried another Whisper demo
         | on HF and had the same results. Switching to Chrome made it
         | work flawlessly... I have no idea what kind of codec
         | incompatibility was happening.
        
         | clemnt wrote:
         | this is amazing! got it working in French too
        
       | TaylorAlexander wrote:
       | Hey this looks great! I like to record audio notes while driving
       | in my car after work, to kind of decompress my thoughts from the
       | day. But I never go back and listen as they can be long and
       | meandering. Sometimes in the audio log I will sum up my thoughts,
       | but this might be 20 minutes in and hard to find. I really wish I
       | had transcriptions so I could easily scan the full contents. I
       | have tried Mozilla Deepspeech (I don't want a cloud solution) and
       | I was surprised to find that I could not get Deepspeech to
       | reliably transcribe them. There is a bit of road noise, though I
       | think for a human listener they are easy to understand. It looks
       | like this one might actually do the trick!
       | 
       | EDIT: Tried it and it worked great! It is very easy to use. I
       | just did the pip install line in the readme and was ready to go.
       | You literally just run the one pip install line, and then you run
       | the program in the format "whisper my_audio.wav" and it goes.
       | Really nice job OpenAI!
        
         | zhynn wrote:
         | I do this too! I have been doing it for about a year now, and
         | haven't ever run into someone else that does this kind of
         | audio-journaling. Would you be up for comparing notes sometime
         | about how it is working out for you? I am finding that it is
         | extremely effective form of self-care, but with lots of
         | personal caveats. I would be so interested to hear your
         | experience.
        
           | blueberrychpstx wrote:
           | Count me in!! Working on tools actually to turn these
           | transcriptions into something more social
        
           | tekacs wrote:
           | I do this too, and I've built some software for it just for
           | myself.
           | 
           | I'd love to chat and hear about how you use this! My email is
           | in my profile, or I'm @tekacs on Twitter (and everywhere). :)
        
           | TaylorAlexander wrote:
           | Oh cool! Yeah I have stopped doing it lately as I was not
           | really using them (I would like to use them for making rough
           | notes for future youtube video scripts), though in general it
           | does seem like good self care too even if I don't review
           | them. That said I just tried the base model on one of my
           | voice logs and it was pretty good! Trying the medium model
           | now and it seems basically perfect. So I will have to start
           | doing these logs more!
           | 
           | Anyway I am pretty terrible with email but short exchanges
           | can work for me, or maybe we can connect over signal. Send me
           | a message to my email in my profile and I would be happy to
           | sync up!
        
         | Snitch-Thursday wrote:
         | Google's recorder app for android will let you record audio
         | files and make some transcriptions, right on the device.
        
           | olao99 wrote:
           | Google's recorder app is NOT available for most phones. Only
           | Pixels and a couple of other selected handsets
        
           | Tenoke wrote:
           | I just tested it and it was pretty mediocre at least with my
           | accent. I can definitely benefit from a decent app for quick
           | note recording with a button press->transcribe->upload to
           | gdrive/good UI app for later grepping.
        
             | TaylorAlexander wrote:
             | Was this with the default base model, or the medium or
             | large model? This can be specified with the --model flag.
        
               | Tenoke wrote:
               | I meant the 'Google's recorder app' from the parent
               | comment and not Whisper.
        
               | TaylorAlexander wrote:
               | Ah right, sorry got my comment threads mixed up! Someone
               | else was asking about performance with accented English
               | speakers in another comment.
        
           | capableweb wrote:
           | Is that application actually doing on-device transcription?
           | Under "Data safety" on the Google Play page it says "This app
           | may share these data types with third parties: Audio" which
           | doesn't exactly instill confidence that my audio will 100%
           | always stay on my device. It also says "Data is encrypted in
           | transit" but if data stays on the device, why it has to be
           | "encrypted in transit"? There should be no transit at all.
        
             | bruckie wrote:
             | Yes, it works completely offline, including transcription
             | and recognition of music. There's an optional cloud sync
             | feature, which I assume is the reason for the notice on
             | Google Play.
             | 
             | (Work for Google, don't speak for them.)
        
               | capableweb wrote:
               | Thanks. Whose the third party that might get access to
               | the audio? First party would be me, second party would be
               | Google and then the third?
        
               | zed1726 wrote:
        
               | bruckie wrote:
               | I think it's just Google for backup, or other apps via
               | Android's standard sharing sheet. You can read the
               | details here: https://support.google.com/pixelphone/answe
               | r/9516618?hl=en
        
         | petercooper wrote:
         | I'll probably explore using this, but I've used an app called
         | Just Press Record to do what you say. Runs on Apple Watch too,
         | so you can tap a complication at any time in the day, speak,
         | and you get a transcript on your phone, etc.
        
       | anigbrowl wrote:
       | Oh nice - I have an immediate use case for this. This looks
       | accessible enough that the sci-fi dream of instantaneous audio
       | translation is suddenly within reach.
        
       | petercooper wrote:
       | Just tested this on some developer podcasts which usually fail
       | hard given they're full of technical jargon, brand names, etc.
       | Whisper is a revolution! It's picking up terms like Heroku,
       | DigitalOcean, GitHub, ECS, AWS, etc. and capitalizing properly -
       | something nothing else did unless you provided a whole pile of
       | guiding vocabulary.
        
         | ma2rten wrote:
         | Did these podcasts have transcripts? You might be inadvertently
         | evaluating it on data that it was trained on, which is
         | basically cheating. Even if not, it might be trained on similar
         | podcasts. Judging how good these kinds of models are is really
         | hard.
        
           | petercooper wrote:
           | No transcripts, no. And recent episodes, within the past
           | couple of weeks, so probably not part of the training either.
        
           | WiSaGaN wrote:
           | True. The test should only be done on the material released
           | _after_ the model.
        
       | code51 wrote:
       | First off, it seems that the model can easily run on M1/M2 with
       | minor modification. However `aten::_index_put_impl_` operator is
       | current not supported and fallback always slows things down quite
       | a lot.
       | 
       | Second, is there a bug with how the script processes incoming
       | audio segments? For a short 4 second clip, what I got was:
       | 
       | > [00:00.000 --> 00:03.760] Okay, Eunice, travel plans. I need to
       | be in New York on Monday, L.A. on Tuesday, New York on Wednesday,
       | L.A. on Thursday. You're knocking Friday. Got it?
       | 
       | > [00:03.760 --> 00:28.760] Got it.
       | 
       | However the final segment should have been shy of 1 second. It
       | mistakenly thinks the last segment was 25 seconds long and makes
       | you wait for processing.
        
       | Jnr wrote:
       | Cool!
       | 
       | I am one of the top contributors to the tiny Mozilla Common Voice
       | data-set for my language. The data-set is very small compared to
       | other popular languages and none of the other mentioned data-sets
       | contribute to that language to train the model of Whisper.
       | 
       | And even with so little data to train on it still works
       | surprisingly well.
        
         | catfan wrote:
         | [zalgo redacted]
        
           | dang wrote:
           | Hey - can you please not zalgo on HN? It messes up the
           | threads. I've redacted it from your posts now.
        
         | archon1410 wrote:
         | Where do they mention what datasets they've used? I've tried
         | looking at the paper but can't find it.
        
           | archon1410 wrote:
           | Nevermind: I found it. It's on page 19 and 20 of the paper,
           | under Appendix A ("Evaluation Datasets").
        
       | jdmoreira wrote:
       | Looking forward to see if this works well with foreign accents
        
         | mminer237 wrote:
         | They have an example in the post with a very thick Scottish
         | accent. You should listen to it. It's pretty impressive.
        
       | localy wrote:
       | Are there any published benchmarks available outlining how this
       | compares to other open source ASR software, such as Coqui.ai?
        
       | NaturalPhallacy wrote:
       | This is pretty incredible! https://i.imgur.com/03UFGc8.gif
        
       | jjwiseman wrote:
       | I'm seeing some weird bugs. For example, in one 30 minute mp3,
       | about 6 minutes in it decided that someone said "2200." And then
       | exactly 5.000 seconds later, "2200". And every 5.000 seconds
       | after that, for the next 24 minutes. (No one actually repeated
       | "2200" for 24 minutes.)
       | 
       | A second run gave better results, but in most runs I do see
       | instances where phrases repeat from 2-20 times.
        
       | bickett wrote:
       | Hard to keep up with all the great things. The AI community is
       | really moving quick right now.
        
       | aidenn0 wrote:
       | For those on NixOS, here's a quick and dirty flake.nix that will
       | let you make a venv in which to "pip install"'
       | 
       | Just put it in a flake.nix, and "nix develop" followed by
       | "virtualenv ./venv; . ./venv/bin/activate; pip install
       | git+https://github.com/openai/whisper.git"                   {
       | description = "Python 3.9 development environment";
       | outputs = { self, nixpkgs }:             let               system
       | = "x86_64-linux";               pkgs = import nixpkgs { inherit
       | system; };             in {
       | devShells.${system}.default = pkgs.mkShell {
       | buildInputs = [                   pkgs.ffmpeg
       | pkgs.python39                   pkgs.python39Packages.pip
       | pkgs.python39Packages.numpy
       | pkgs.python39Packages.pytorch
       | pkgs.python39Packages.virtualenv                 ];
       | };             };         }
        
         | aidenn0 wrote:
         | This should, in theory, work with CUDA; my GPU doesn't have
         | enough RAM to do it (it runs out at 2.9GiB allocated, I have
         | 4GiB, but am running a compositing desktop, which chews up
         | about 600MiB; not sure where the other ~400MiB went)
         | 
         | [edit]
         | 
         | I confirmed CUDA worked with the "small" model, which used
         | 3.3GB of GPU ram, and resulted in _much_ poorer recognition
         | than the  "medium" model on my CPU (but it ran at least two
         | orders of magnitude faster).                   {
         | description = "Python 3.9 development environment";
         | outputs = { self, nixpkgs }:           let             system =
         | "x86_64-linux";             pkgs = import nixpkgs {
         | inherit system;               config.allowUnfree = true;
         | config.cudaSupport = true;             };           in {
         | devShells.${system}.default = pkgs.mkShell {
         | buildInputs = with pkgs; [                 cudatoolkit
         | linuxPackages.nvidia_x11                 cudaPackages.cudnn
         | libGLU libGL                 xorg.libXi xorg.libXmu freeglut
         | xorg.libXext xorg.libX11 xorg.libXv xorg.libXrandr zlib
         | ncurses5 stdenv.cc binutils                 ffmpeg
         | python39                 python39Packages.pip
         | python39Packages.numpy
         | python39Packages.pytorch-bin
         | python39Packages.virtualenv               ];
         | shellHook = ''                   export
         | LD_LIBRARY_PATH="${pkgs.linuxPackages.nvidia_x11}/lib"
         | '';                       };           };         }
        
           | magicalhippo wrote:
           | CUDA worked fine with large on my 2080Ti FWIW. The speedup is
           | ridiculous, as expected. My Ryzen 3800X used almost an hour
           | transcribing a minute worth of speech, while the 2080Ti does
           | it in like 10-20 seconds.
        
             | aidenn0 wrote:
             | How much GPU ram did it use?
        
               | magicalhippo wrote:
               | I'm on Windows, using Task Manager, the dedicated GPU
               | memory went from 1GB before run to about 9.8GB for the
               | most time during run, peaking at 10.2GB. So pretty close
               | to the 11GB limit of my 2080Ti it seems.
        
       | BasilPH wrote:
       | Any opinions on what this means for speech-to-text companies like
       | rev.ai and assmembly.ai ?
       | 
       | We've tested open source solutions for s2t, like kaldi, but the
       | quality was not good enough. However, one of the main advantages
       | of a service like assembly.ai to me was that they offer sentence
       | splitting in form of punctuation and speaker detection, which
       | Kaldi does not.
       | 
       | So I guess I answered my own question to some degree: A S2T
       | service is more than just S2T. We already see assembly.ai add
       | more and more features (like summarisation, PID redaction ect.)
       | that are a value-add to plain S2T.
       | 
       | Still, curious to hear what your take on that is.
        
         | nshm wrote:
         | You can apply public punctation model from Vosk on top of Kaldi
         | output, you can also get speaker labels with existing open
         | source software.
         | 
         | On quick video transcription test this model is more accurate
         | than AssemblyAI and Rev AI. It will be harder for them to sell
         | pure ASR now. Some more business-oriented applications will
         | still be important though, for example ASR as part of
         | callcenter analytics solution or as a part of medical ERP
         | system.
         | 
         | The value of automatic summarization is small, without AI it is
         | very hard to make it right, you need to be an expert in the
         | field to understand what is important.
        
           | phren0logy wrote:
           | Rev AI will also create a transcription separated by multiple
           | speakers, which it doesn't appear Whisper can do (yet). I
           | expect that Whisper will overtake the alternatives soon,
           | given that it's open source, but today it's not there yet.
        
       | adeptima wrote:
       | Japanese results looks pretty impressive!
       | 
       | Took matsukoukuzira14Tou gaHai An niDa chiShang gerareru
       | osutoraria(2022Nian 9Yue 21Ri )
       | https://www.youtube.com/watch?v=bZkNIzeRBk4
       | 
       | Extracted audio with youtube-dl -f bestaudio
       | https://www.youtube.com/watch\?v\=bZkNIzeRBk4
       | 
       | Converted into [00:00.000 --> 00:13.000] osutorariaNan Bu noDao
       | de, Zhen tsuXiang kuzira14Dong gaHai An niDa chiShang gerareteSi
       | ndeirunogaJian tsukari, Zhuan Men Jia gaDiao Cha notameYuan Di Ru
       | rishimashita.  [00:13.000 --> 00:25.000] Yuan Di
       | medeianiyorimasuto, osutorariaNan Bu nokinguDong de, 19Ri , Shao
       | nakutomo14Dong noZhen tsuXiang kuziragaHai An niDa chiShang
       | gerareteSi ndeirunogaJian tsukarimashita.  [00:25.000 -->
       | 00:31.000] hotondogaRuo iosutowoJian rare, Zhuan Men Jia gaXian
       | Chang niZhong mukiDiao Cha niDang tatsuteimasu.  [00:31.000 -->
       | 00:41.000] kuziranoSi Hai haDa kikuYun ndariMai
       | metarisurukotogaNan shiitame, Zi Ran niFen Jie sarerunowoDai
       | tsuFang Zhen gaJian Tao sareteimasu.  [00:41.000 --> 00:52.000]
       | mata, Si Hai woJu i, samegaHai niJi maruKe Neng Xing
       | gaarutoshite, Yuan Di Dong Ju hasahuanadoniZhou Wei niJin
       | dukanaiyouniHu bikaketeimasu.  [00:52.000 --> 01:02.000] Yi Fang
       | , 21Ri nihatasumaniaDong deoyoso230Dong nokuziragaBang Bian niDa
       | chiShang geraretaZhuang Tai deJian tsukarimashita.  [01:02.000
       | --> 01:07.000] oyosoBan Shu gamadaSheng kiteiruMo Yang deJi Zhu
       | Huo Dong gaJin merareteimasu.  [01:07.000 --> 01:23.000] Jian
       | tsukatsutanoha, gondokuziranoZhong Jian toJian rareteimasu.
        
         | knaik94 wrote:
         | Did you try translating them to english? I want to see if you
         | get a similar error as me with a random phrase "Translated by
         | Releska" showing up.
        
           | lynguist wrote:
           | It's called hallucination. As the model is trained on
           | unsupervised data, such errors do seldom happen. The model
           | picks up that such phrases occur in translations and inserts
           | them even if they do not appear in the source. This is
           | described in the paper.
        
             | knaik94 wrote:
             | I came across it during a silent/instrumental portion in
             | the song I was testing. I asked only because I am curious
             | how frequently the error might show up, I don't expect it
             | to be very common. It's looking at phrase level instead of
             | word level timestamps which is going to make it hard to
             | tokenize music. I asked simply because the parent comment
             | also tested on Japanese.
        
         | gzer0 wrote:
         | Shocked at how good the results are, and how easy of an
         | installation it is.
         | 
         | Here are the exact steps to follow to get it running on Ubuntu
         | 22.04 via WSL and yt-dlp:                 1. pip install
         | git+https://github.com/openai/whisper.git            2. yt-dlp
         | -f 'ba' -x --audio-format mp3
         | https://www.youtube.com/watch/?v\=bZkNIzeRBk4            3.
         | renamed the file to test.mp3            4. whisper test.mp3
         | --language Japanese --task translate --model large
         | 
         | Note: the large model will download a ~3Gb file
        
           | NaturalPhallacy wrote:
           | I did something similar (my ytdl is ytdlp too). You don't
           | even have to grab just the audio, it'll take a webm:
           | https://i.imgur.com/03UFGc8.gif
           | 
           | Amazing work.
        
             | adeptima wrote:
             | cause ffmpeg inside
             | 
             | https://github.com/openai/whisper/blob/main/requirements.tx
             | t
             | 
             | should process most formats
        
           | adeptima wrote:
           | "--model large" option produces much better results at higher
           | resources consuming costs
        
       | tullie wrote:
       | Great to see OpenAI finally being open :)
        
       | simmanian wrote:
       | Could someone tell me whether it's possible to somehow feed data
       | into this project to improve its translation and transcription
       | capabilities on our own?
        
       | nicholasjarnold wrote:
       | This is so cool! I was just speaking to a non-technical family
       | member about privacy concerns around using "OK Google" and the
       | like. They responded inquiring about "private" alternatives, to
       | which my answer was "I'm not aware of good ones that give you
       | that level of accuracy and convenience."
       | 
       | Perhaps this development along with continued optimization and
       | device compute power increases will lead us into a near-future
       | where things like Mycroft devices and cellphones could have
       | local-only speech-to-text and translation capabilities which are
       | accurate even with environmental background noise variations
       | encountered IRL.
       | 
       | Great work OpenAI team!
        
       | runlevel1 wrote:
       | I ran it on some fire department radio recordings from scanners
       | on Broadcastify. It did remarkably well.
       | 
       | For reference, GCP's Speech-to-Text didn't detect any speech from
       | this clip -- even when using the enhanced phone model.
        
       | mwlp wrote:
       | Super impressive. I tested it on a Japanese streamer whose
       | enunciation isn't exactly perfect and it did a decent job:
       | https://www.youtube.com/watch?v=ROiOU1scaNA
       | [00:00.000 --> 00:06.500]  Since the last one started, the number
       | of times I've eaten has decreased.       [00:06.500 -->
       | 00:11.000]  If I get too carried away with the last one, I'll get
       | hungry and do it.       [00:11.000 --> 00:14.500]  I don't have
       | time to eat.       [00:15.500 --> 00:18.000]  I'm going to eat
       | now.       [00:20.000 --> 00:23.000]  It's going to take about 10
       | minutes from here.       [00:23.000 --> 00:31.000]  It's been a
       | while since I've had my last meal.       [00:31.000 -->
       | 00:36.000]  I feel like I'm losing myNu Zi Li .       [00:36.000
       | --> 00:39.000]  I have to go back to my original self.
       | [00:39.000 --> 00:44.000]  I have to get ready and go to bed.
       | [00:44.000 --> 00:46.000]  It's not good.       [00:46.000 -->
       | 00:51.000]  I've been drinking a lot lately, so I'm going home.
       | [00:51.000 --> 00:53.000]  I have to get my nails done this fall.
       | [00:53.000 --> 00:54.000]  Halloween nails.       [00:54.000 -->
       | 00:57.000]  Halloween, Halloween, Halloween.       [00:57.000 -->
       | 00:59.000]  I'm going to the beauty salon today.       [00:59.000
       | --> 01:02.000]  I'm going to get my nails done the day after
       | tomorrow.       [01:02.000 --> 01:10.000]  I used to look at a
       | lot of clothes, but I stopped looking at them.       [01:10.000
       | --> 01:12.000]  I'm going crazy.       [01:12.000 --> 01:22.000]
       | My stomach's stopped in the middle of summer.
        
         | alach11 wrote:
         | How long until this gets implemented in Twitch? Real-time
         | subtitles for any stream in the language of your choice?! That
         | would be huge.
        
         | adeptima wrote:
         | translation is not the strongest part. transcription looks very
         | good.
        
         | magicalhippo wrote:
         | It's struggling with Norwegian. Which I guess isn't shocking.
         | The large model performs a fair bit better than the small,
         | though neither is "good".
         | 
         | Though I assume the amount of Norwegian it has been exposed to
         | is fairly limited, so in that light I'm actually impressed as
         | well.
         | 
         | I tried it on a news segment from the radio[1], this is the
         | large model output:                   [00:14.000 --> 00:17.200]
         | En skamlos krenking av FN pakten.         [00:17.200 -->
         | 00:24.000]  USAs president og verdensledere svarer pa den
         | russiske presidentens atomtrusler og krigsmobilisering.
         | [00:25.500 --> 00:29.400]  Arbeidsklaer som er ment til a vaere
         | til begge kjonn, har det med a vaere tilpasset.
         | [00:29.400 --> 00:33.400]  Men hvordan ville det gatt, om det
         | var motsatt?         [00:34.100 --> 00:38.900]
         | Dyrevernsorganisasjon vil ha digital merking av regnstyr,
         | [00:38.900 --> 00:44.900]  men naeringen selv insisterer pa den
         | gamle tradisjonsrike maten med rissing av kniv.
         | [00:45.600 --> 00:51.400]  Mange stromselskaper er positive til
         | a tilby kundene fastpris pa strom, og det arevis.
         | [00:51.400 --> 00:59.900]  Da risikerer de a matte betale mye i
         | nettopp aretsvis, sier aktorer som aldri tilbyr fastpris.
         | [00:59.900 --> 01:21.900]  Dette er onsdagens Dagsnytten. Jeg
         | heter Espen As.
         | 
         | For reference, here's what he actually said, from the source[1]
         | itself:                   * En skamlos krenking av FN-pakten.
         | USAs president og verdensledere svarer pa den russiske
         | presidentens atomtrusler og krigsmobilisering.         *
         | Arbeidsklaer som er ment a vaere til begge kjonn, er som regel
         | tilpasset ... menn. Hvordan hadde det gatt om det var motsatt?
         | * Dyrevernsoganisasjon vil ha digital merking av reinsdyr, men
         | naeringen selv insisterer pa den gamle tradisjonsrike maten med
         | rissing av kniv.         * Mange stromselskaper er positive til
         | a tilby kundene fastpris pa strom - og det i arevis.         -
         | Da risikerer de a matte betale mye i nettopp; arevis, sier
         | aktor som aldri tilbyr fastpris         Dette er onsdagens
         | Dagsnytt 18 - jeg heter Espen Aas.
         | 
         | The translation didn't fare that well though:
         | [00:14.000 --> 00:17.000]  A shameless violation of the UN
         | treaty.         [00:17.000 --> 00:24.000]  The US president and
         | world leaders respond to the Russian president's nuclear
         | threats and war mobilization.         [00:24.000 --> 00:33.000]
         | Work clothes that are meant to be for both genders have to be
         | suitable, but how would it be if it was the other way around?
         | [00:34.000 --> 00:44.000]  The animal welfare organization will
         | have a digital marking of reindeer, but the industry itself
         | insists on the old traditional way of tearing a knife.
         | [00:45.000 --> 00:51.000]  Many electricity companies are
         | positive in offering customers fixed electricity prices, and
         | that is annual.         [00:51.000 --> 00:58.000]  Then they
         | risk having to pay a lot in just a year, says an actor who has
         | never offered fixed prices.         [00:58.000 --> 01:20.000]
         | This is Wednesday's Dagsnytt 18. My name is Espen As.
         | 
         | For reference, here's Google Translate's attempt, which is
         | pretty good:                   * A shameless violation of the
         | UN Charter. The US president and world leaders respond to the
         | Russian president's nuclear threats and war mobilization.
         | * Work clothes intended for both sexes are usually adapted to
         | ... men. How would it have gone if it had been the other way
         | around?         * Animal welfare organizations want digital
         | marking of reindeer, but the industry itself insists on the
         | old, traditional way of marking with a knife.         * Many
         | electricity companies are positive about offering customers a
         | fixed price for electricity - and for years.         - Then
         | they risk having to pay a lot in precisely; for years, says a
         | player who never offers a fixed price         This is
         | Wednesday's Dagsnytt 18 - my name is Espen Aas.
         | 
         | [1]:
         | https://radio.nrk.no/podkast/dagsnytt_atten/l_5ce3e323-97a3-...
         | (not sure if it's available outside of Norway)
        
           | perlgeek wrote:
           | Everything (and everyone, including myself :D ) seem to
           | struggle with Norwegian, it seems the corpus size is simply
           | too small. And/or maybe the market.
           | 
           | Deepl didn't do any Norwegian last I looked, even though it
           | does most other Germanic languages (including Danish and
           | Swedish).
           | 
           | Duolingo doesn't have a Norwegian class for Germans either,
           | though they do have one with English as the source language.
        
           | olao99 wrote:
           | How are you getting the transcription of the NRK episode? I
           | am learning Norwegian and often struggle to find reliable
           | transcriptions for audio where the text exactly matches the
           | audio (often subtitles are heavily edited compared to what's
           | actually being said)
        
             | magicalhippo wrote:
             | The stuff I quoted was listed as an abstract of sorts for
             | the episode. I know NRK is very good at providing subtitles
             | for their TV productions, but as you say they're
             | abbreviated.
             | 
             | I'm guessing maybe audio books along with the actual books
             | would be the best source for such? I mean there's Mozilla
             | Voice, but it's quite limited in the Norwegian department
             | and perhaps not quite as interesting as an audio book would
             | be.
        
           | magicalhippo wrote:
           | Re-reading the transcription, I guess I was a bit harsh by
           | saying it's not "good". It gets most of it right, but it
           | keeps messing up some key words. Like "regnstyr" (not a word)
           | rather than "reinsdyr" (reindeer), or "Dagsnytten" rather
           | than "Dagsnytt 18".
           | 
           | It also didn't handle the hanging "... menn", instead
           | thinking it was the start of the following sentence. Almost
           | everyone would understand it was the end of the sentence
           | based on the context.
           | 
           | The double-A vs A is not an issue as it's the same letter,
           | double-A is the older form.
           | 
           | The small model was considerably worse than the large one
           | though.
        
       | kiwih wrote:
       | Given this, are there good (and available/open source) models for
       | text to speech? Last time I tried everything still sounded
       | extremely robotic, and/or were a pain to set up and run. It would
       | be fun to set up a pipeline where the two processes
       | 'communicate'.
        
         | obscur wrote:
         | Measuring performance in rounds of successful Chinese whisper
         | 
         | (irony)
        
       | Bayko wrote:
       | So I guess we can easily use this to generate subtitles?? Which
       | would be nice! Cause ummm some of the movies that I download from
       | the internet arrrrrr! don't have subtitles available
        
       | pen2l wrote:
       | Neat, https://github.com/openai/whisper - they have open-sourced
       | it, even the model weights, so they are living up to their name
       | in this instance.
       | 
       | The 4 examples are stunningly good (the examples have speakers
       | with heavy accents, speaking in foreign language, speaking with
       | dynamic background noise, etc.), this is far and away better than
       | anything else I've seen. Will be super curious to see other folks
       | trying it out and seeing if it's as robust as it seems, including
       | when confronted with audio speech with natural tics and uhhh's
       | and uhmm's and everything in-between.
       | 
       | I think it's fair to say that AI-transcription accuracy is now
       | decidedly superior to the average human's, what the implications
       | of this are I'm not sure.
        
         | anigbrowl wrote:
         | It was already better. I edit a podcast and have > a decade of
         | pro audio editing experience in the film industry, and I was
         | already using a commercial AI transcription service to render
         | the content to text and sometimes edit it as such (outputting
         | edited audio).
         | 
         | Existing (and affordable) offerings are so good that they can
         | cope with shitty recordings off a phone speaker and maintain
         | ~97% accuracy over hour-long conversations. I'm sure it's been
         | an absolute godsend for law enforcement other people who need
         | to gather poor-quality audio at scale, though much less great
         | for the targets of repressive authority.
         | 
         | Having this fully open is a big deal though - now that level of
         | transcription ability can be wrapped as an audio plugin and
         | just used wherever. Given the parallel advances in resynthesis
         | and understanding idiomatic speech, in a year or two I probably
         | won't need to cut out all those _uuh like um y 'know_ by hand
         | ever again, and every recording can be given an noise reduction
         | bath and come out sounding like it was recorded in a room full
         | of soft furniture.
        
           | adamgordonbell wrote:
           | I've not found that to be the case.
           | 
           | For technical content, I use Rev.com and provide a glossary
           | and real humans do the transcript. Other AI transcription
           | services get lots wrong because the context often matters.
           | Words like "TCP/IP" or "FAT disk format" or "Big Endian" I've
           | never found AI so far to handle well.
           | 
           | I'm interested to test out whisper on this one.
           | 
           | https://corecursive.com/063-apple-2001/
        
           | deegles wrote:
           | There's already software that can imitate a person's voice,
           | so we have all the pieces already to do speech-to-text, clean
           | up with GPT-3, and back to text-to-speech in the original
           | person's voice. Maybe with a style transfer to keep the
           | person's inflections etc the same?
        
             | Karuma wrote:
             | I think something similar already exists. See this, for
             | example: https://koe.ai/recast/
             | 
             | Although I don't know if they're using anything similar to
             | what you suggest. Very cool idea, anyway!
        
           | biomcgary wrote:
           | Since you work on podcasts, do any open source transcription
           | tools currently identity the speaker in the output? This
           | would be particularly helpful for interviews.
        
             | nico wrote:
             | Not sure about open source, but in general, automated
             | transcription systems need a separate track for each
             | different speaker. So for example, for a phone call with
             | one person on each end, you need two separate channels
             | (recording systems usually split them left/right on one
             | stereo file).
        
           | solarmist wrote:
           | Any recommendations for particular services?
        
             | anigbrowl wrote:
             | I use a service called sonix.ai. It's paid but I think they
             | have a free tier or trial period, and it's not very
             | expensive. I'm excited about this new OpenAI thing because
             | I'd rather do it on my own hardware than send it to the
             | cloud, but this company has earned its commercial success.
        
           | nonoesp wrote:
           | I'm not sure if you've tried Descript, but their ML-based
           | "Studio Sound" filter makes bad audio sound like it was
           | recorded and edited nicely.
        
           | solarmist wrote:
           | That is an exciting possibility. Being able to fix bad setups
           | and missed takes automagically. It's always been possible,
           | just expensive and time consuming for moderate improvements.
        
           | thfuran wrote:
           | >~97% accuracy over hour-long conversations. I'm sure it's
           | been an absolute godsend for law enforcement
           | 
           | 97% accuracy means roughly three or four errors per minute of
           | speech. That seems potentially extremely problematic for
           | something like law enforcement use where decisions with
           | significant impact on people's day and/or life might be made
           | on the basis of "evidence".
        
             | gs17 wrote:
             | Yeah, I tried to use automated transcription for a research
             | project and we had to do it all manually because the few
             | errors (I would say it did pretty well given our recording
             | quality) were often dropping words like "not", which
             | changed the whole meaning of a sentence! It was a useful
             | assistance during transcription, but I really hope they
             | would verify it was correct before arresting anyone based
             | on it.
        
             | anigbrowl wrote:
             | No it isn't. That just means 2-3% of your content needs to
             | be double-checked by a person at the audio level, saving
             | huge amounts of time - equally true of human transcription,
             | in which individual words are often [UNINTELLIGEBLE].
             | 
             | Would you want to review this fully before going into
             | court, absolutely - because you'd want to play the
             | recording to a jury for emotional impact. Can you rely on
             | it when you want to quickly read through hours of
             | conversation and make decisions about whether to invest
             | further resources (which might just mean another hour of
             | listening back to the original audio)? Also absolutely.
             | Bear in mind that a lot of these errors have little to no
             | semantic impact, being on the same level as typos or
             | misspellings in a written communication.
             | 
             | Bear in mind too that if law enforcement (honest or not) is
             | so interested in you that they're willing to record your
             | conversations, your day is already ruined, you just don't
             | know it yet. The change here is one of scale rather than
             | quality.
        
               | wging wrote:
               | Doesn't it mean 100% of your content needs to be double-
               | checked? You can't easily identify which 2-3% of your
               | content has errors. I'm aware that errors are more likely
               | when the model is less confident of its predictions, but
               | that shouldn't be enough.
               | 
               | (edit for clarification: errors are not always something
               | like "[UNINTELLIGIBLE]", where the system knows it
               | doesn't know; they can also be misrecognitions that the
               | system believes in with high confidence.)
        
               | u8 wrote:
               | I had to do a lot of manual transcription in Journalism
               | school. Using a tool like Descript saved HOURS of my
               | life. Generally it was 80% accurate, but going over an
               | two-hour-long recording again at 3x speed while reading
               | over the transcript, fixing errors from memory or pausing
               | took a five hour job down to 30-40 minutes. Either way,
               | somebody is going to have to listen to the recording.
               | This just removes a layer of grunt work.
        
               | 6gvONxR4sf7o wrote:
               | > I'm aware that errors are more likely when the model is
               | less confident of its predictions, but that shouldn't be
               | enough.
               | 
               | Suppose 90% of the errors are in the 10% where the model
               | is least confident. Then you can review just 10% of your
               | content and take a 2% error rate down to 0.2% error rate.
        
               | woah wrote:
               | You double check things that you think are important, in
               | this case, passages that will be used as evidence in
               | court.
        
               | guelo wrote:
               | Maybe you could run the text through a grammar checker to
               | identify the errors.
        
               | thfuran wrote:
               | That might work if people were required to speak
               | grammatically.
        
               | NaturalPhallacy wrote:
               | For real. The way people normally speak, with
               | backtracking, repetition, restarting sentences, or
               | stopping mid sentence and starting a new one with
               | entirely different nouns or entire subjects is perfectly
               | normal in synchronous conversation and isn't jarring, but
               | written down as is, it's like 40% noise.
        
               | worthless-trash wrote:
               | For a good example of this, read ANY of trumps speaches
               | transcribed.
        
               | NaturalPhallacy wrote:
               | I mean if you want to make it unnecessarily political,
               | Biden's are worse:
               | https://www.youtube.com/watch?v=3bWM1zsnTJc
        
               | worthless-trash wrote:
               | Oh no no, i wasn't trying to be political, its just one
               | that I read.. and wow you're right!
        
               | gzer0 wrote:
               | To be fair, you chose a video that displays an
               | amalgamation of the biggest gaffes of 2021 for Biden.
               | 
               | "During his term as President of the United States,
               | Donald Trump made tens of thousands of false or
               | misleading claims. The Washington Post's fact-checker had
               | tallied the number as 30,573 by January 2021, an average
               | of about 21 per day by the end of his presidency."
               | [1][2][3][4]
               | 
               | I think it's fair to say there would be a 100 hour long
               | plus video / documentary if they were all compiled into
               | one. lovely!                 - [1] Fact Checker (January
               | 20, 2021). "In four years, President Trump made 30,573
               | false or misleading claims". The Washington Post.
               | Archived from the original on January 20, 2021.
               | - [2] Kessler, Glenn (January 23, 2021). "Trump made
               | 30,573 false or misleading claims as president. Nearly
               | half came in his final year". The Washington Post.
               | Archived from the original on January 24, 2021. Retrieved
               | January 24, 2021.            - [3] Elfrink, Tim (August
               | 14, 2020). "'Do you regret at all, all the lying you've
               | done?': A reporter's blunt question to Trump goes
               | unanswered". The Washington Post. Retrieved August 14,
               | 2020.
               | 
               | [4] https://en.m.wikipedia.org/wiki/Veracity_of_statement
               | s_by_Do...
        
               | donkarma wrote:
        
               | TheCapeGreek wrote:
               | Having done audio transcription in college as a side gig,
               | it takes a lot longer than it sounds. Even at a decent
               | 100wpm you'll take about 5 minutes to type out 1 minute
               | of audio.
               | 
               | Not having to pause + rewind will save a ton of time for
               | that 3%.
        
               | vivegi wrote:
               | You can also use multiple transcription engines and then
               | use mismatches among the text streams to narrow down the
               | % of content that needs to be reviewed. This is quite
               | similar to multi-voting OCR for document images.
               | 
               | The principle is that the engines have different failure
               | modes (hopefully) and therefore the 2-3% error rate of
               | each engine is in different areas of the audio. The key
               | underlying assumption is that the events are mutually
               | exclusive.
               | 
               | With 3 engines, you can use something like 2-of-3 stream
               | matches to override the stream that mismatches.
        
               | anigbrowl wrote:
               | By the time you're prosecuting someone in court, yes of
               | course you double, triple, quadruple check everything.
               | That's why lawyers get paid the big bucks (for now...).
               | But yes you can identify which content probably has
               | errors and flag it as such.
               | 
               | Look, I have decades of experience dealing with human
               | speech, and not just as an editor - I can trace the human
               | voice from neural impulses in Broca's region through the
               | physiology of vocal production, mechanical transduction
               | into electrical signals, discrete fourier transforms of
               | the resultant waveforms into spectral information and
               | back again, the reproduction of altered signals from
               | time-aligned speakers to create a sense of
               | spatialization, how those are processed in the human ear,
               | and how the cilia are connected by nerves back to your
               | brain. I'm a good enough editor that I can recognize many
               | short words by sight of a waveform, or make 10 edits in a
               | row by sight and know it will sound good on playback.
               | 
               | So when I say that machine transcription is as good as
               | human realtime transcription now, I say so with the clear
               | expectation that those decades of craft are very close to
               | being rendered obsolete. I absolutely expect to hand off
               | the mechanical part of editing to a machine within 2
               | years or so. It's already at the stage where I edit some
               | interviews as text, like in a word processor, and then
               | export the edited document as audio and it's Good Enough
               | - not for every speaker, but more than half the time.
               | 
               | NPR and a lot of commercial broadcasters cut their
               | material this way already, because you can get the same
               | result from 30 minutes of reading and text editing that
               | would require 3 hours of pure audio editing with no
               | transcription.
        
               | yourapostasy wrote:
               | _> So when I say that machine transcription is as good as
               | human realtime transcription now..._
               | 
               | Would you go as far as to assert machine transcription
               | can be used as an objective benchmark of a speaker's
               | verbal legibility?
               | 
               | It is fraught with political and interpersonal dynamics
               | to approach someone even privately one on one today and
               | gently suggest their career would get a huge boost if
               | they hired a voice coach to help improve their verbal
               | communication delivery. So even when I don't directly
               | mention their accent, it becomes a very sensitive subject
               | with many.
               | 
               | However, if audio professionals like you can point to a
               | system and say the raw biomechanics and acoustic physics
               | of the world dictate that this is as physically and
               | psychometrically good as audio parsing of human speech
               | gets regardless whether the system was biologically
               | evolved or ML evolved, the conversation can be couched
               | even more objectively.
               | 
               | I enable recording and voice transcription in every
               | meeting I can (ostensibly for DE&I but really for my own
               | selfish purposes), and already observe in myself I have
               | to work hard to overcome a tendency to gloss over
               | speakers who don't transcribe well when I review meeting
               | transcripts to jot down any key information I might have
               | missed taking notes upon during the meeting.
               | 
               | Note that I'm perfectly aware that my foreign language
               | verbal skills are nowhere near the English skills of
               | those I have tried to help. If the _lingua franca_ of the
               | coding world switched to Urdu tomorrow, then I'd hire
               | help to learn and polish my spoken Urdu, like I went to a
               | speech coach when learning public speaking because I can
               | always use help in the many skills I lack.
        
               | frognumber wrote:
               | What tools do you use to do this? I once hacked together
               | an editor like this maybe a decade ago -- edit speech as
               | text from OCR -- and sorely need one now.
               | 
               | Alignment of video to text is a big problem for me too.
        
               | boundlessdreamz wrote:
               | This can be done via https://www.descript.com/ You can
               | edit video/audio by editing the transcript.
               | 
               | You can even add/modify words that weren't originally
               | there https://www.descript.com/overdub
        
               | etienne618 wrote:
               | Presumably you can use the 97% that is correctly
               | transcribed to rapidly filter out the relevant content.
               | This is likely to be only a small portion of the total
               | content. Then you check 100% of that.
        
               | datalopers wrote:
               | If you know which 2-3% are the false positives, you have
               | a very lucrative business model.
        
               | MonkeyMalarky wrote:
               | When doing validation, I find it will often be the same
               | errors repeated again and again in a transcription. Like
               | it will fail on someone or some thing's name (that is
               | rare / unique) and map it onto a known similar sounding
               | word.
        
               | gnramires wrote:
               | I think an [UNINTELLIGIBLE] indication would be a great
               | addition to automatic transcription systems.
        
               | inanutshellus wrote:
               | It'd [UNINTELLIGIBLE score="92%" alternatives="pro-
               | rabble; pourable"]probably[/UNINTELLIGIBLE] be useful to
               | make a markup-based output... though you'd probably find
               | it gave you more info than you wanted.
        
               | anigbrowl wrote:
               | It already exists. The commercial product I use most is
               | called sonix.ai and I think they have a free tier or
               | trial period. It has shortcomings but it's shockingly
               | good, despite having some limitations.
        
               | yencabulator wrote:
               | Google Voice voicemail transcription _used_ to do this,
               | with varying levels of gray. It seems that feature is
               | gone, now.
        
               | thfuran wrote:
               | >equally true of human transcription, in which individual
               | words are often [UNINTELLIGEBLE].
               | 
               | ML systems somewhat notoriously do not necessarily make
               | the same sorts of errors that a human would. And I'd
               | expect a large portion of the errors to be transcribing
               | the wrong words rather that indicating that a word
               | couldn't be transcribed. That sort of error means that
               | you can't really get away with manually reviewing just 3%
               | of the audio.
        
               | notahacker wrote:
               | ML tending to make _weird_ mistakes rather than subtle
               | ones that make sense in context like human transcribers
               | is likely to make them easier to spot.
               | 
               | And there are humans in the loop too, and an enormous
               | amount of redundancy in the questions and answer, so even
               | plausible false transcriptions will get picked up on if
               | they matter. Nobody gets sent to jail simply because the
               | transcription process - human or machine - accidentally
               | substitutes "I did it" in place of "I didn't" midway
               | through a two hour interview.
        
               | BartjeD wrote:
               | The thing is that 'Likely' is very far away from
               | 'always'. There is no guarantee the mistake will be easy
               | to spot.
               | 
               | For entertainment purposes AI transcription is awesome.
               | 
               | For serious business applications the ability to
               | recognize mistakes will continue to be a field to which
               | serious attention is given. It would be interesting to
               | see AI processes double check itself, and also run a
               | logic check on whether the transcription makes sense. So
               | that it can report sections flagged as incongruous or of
               | dubious reliability.
        
               | iroh2727 wrote:
               | +1. There is a widespread "metric fallacy" or "task
               | fallacy" going around. Models of course optimize for
               | metrics, so they tend to perform well on those related
               | metrics.
               | 
               | Humans, however, are not simply metric optimizers. Though
               | it's always in the interest of those corporations
               | producing metric optimizers (i.e. models) to paint humans
               | as such, so their models shine in comparison. They want
               | humans to look like bad machines, so it looks like they
               | should be automated. Not to say they shouldn't in many
               | cases, just that there's a clear one-sidedness in all
               | corporate PR (and funded research, especially that
               | research which is also PR).
               | 
               | All this to say that yes I agree with you. And if we
               | humans don't want our unsustainable economic growth to
               | turn us even more into machines (as our bureaucratic
               | creep has done quite well thus far), we should fight such
               | rhetoric that aims to paint humans simply as machines or
               | task-doers.
        
             | golem14 wrote:
             | One would think that the few crucial bits of information
             | gleaned are listened to manually, and the machine
             | translation is not the only thing the judge or a jury sees.
        
               | thfuran wrote:
               | You have absolutely ruined someone's day way before
               | they're sitting in front of a jury.
        
               | formerly_proven wrote:
               | Stuff like that is a very good tell that someone has zero
               | experience with law enforcement.
        
             | CTDOCodebases wrote:
             | I imagine a certain percentage of a given population is on
             | a voice call at any one time.
             | 
             | 1. Set up a computer with voice recognition software that
             | flags certain patterns.
             | 
             | 2. Connect computer to voice call communication network.
             | 
             | 3. Configure computer to switch between calls every x
             | number of seconds.
             | 
             | Think of it like a system to generate leads for law
             | enforcement that can be integrated with other systems to
             | produce the best quality leads.
        
               | NaturalPhallacy wrote:
               | This is called "a fishing expedition" and is _wildly_
               | unconstitutional in the US.
               | 
               | > _The right of the people to be secure in their persons,
               | houses, papers, and effects, against unreasonable
               | searches and seizures, shall not be violated, and no
               | Warrants shall issue, but upon probable cause, supported
               | by Oath or affirmation, and particularly describing the
               | place to be searched, and the persons or things to be
               | seized._
        
               | CTDOCodebases wrote:
               | Are you sure about that? [0]
               | 
               | Besides I wasn't talking about the USA when I said this.
               | I was remembering a conversation I once had with a person
               | who worked as a technician in a telephone exchange.
               | 
               | [0] - https://en.wikipedia.org/wiki/Jewel_v._NSA
        
               | jjoonathan wrote:
               | Yes, it is wildly unconstitutional, but in practice don't
               | the courts endorse the asinine "it's not a search unless
               | we find something" argument from the NSA?
               | 
               | Power always just finds a way to rationalize what it
               | wants to do.
        
               | kurisufag wrote:
               | see: Operation PRISM
        
             | Thorentis wrote:
             | Not really. Imagine that they do simple keyword matching on
             | the text. Anything that's missed (part of the 97%) the
             | criminals get away with. Anything that matches in the 3% is
             | then checked by a human (by listening to the audio at that
             | time stamp). So you only need to manually check the 3%, and
             | even then only if something you're interested in is found.
        
             | j-krieger wrote:
             | I've worked with similar technology in the law enforcement
             | space and the software is never used to make decisions. You
             | can make out critical timestamps in conversations and a law
             | enforcement officer will always manually confirm the
             | software's assessments.
        
               | JohnFen wrote:
               | Given that law enforcement has made similar claims about
               | technology use in the past that turned out to be false, I
               | have no faith in this claim.
        
             | hadlock wrote:
             | Microsoft announced their voice transcription technology a
             | couple years ago and were also touting ~97-98% accuracy
             | which was actually _better_ than human transcription error
             | rates. The errors are usually in part people garbling their
             | own speech, or they move their head while talking and the
             | microphone misses a syllable. Anything in that error bar
             | would probably fall under  "reasonable doubt"
        
               | kyriakos wrote:
               | If its anything like Microsoft teams transcription I
               | doubt the 97%+ accuracy.
        
         | knaik94 wrote:
         | It seems far from good with mixed language content, especially
         | with English and Japanese together. The timestamps are far from
         | perfect. It's far from perfect. It's nowhere close to human for
         | the more ambiguous translations that depend on context of word.
         | It's far below what anyone that spoke either language would
         | consider acceptable. Maybe it's unfair to use music, but music
         | is the most realistic test of whether it's superior to the
         | average human.
        
           | quickthrower2 wrote:
           | Some music is hard for even people to make out the lyrics to.
        
         | soheil wrote:
         | Their name reminds of the company McDonald's uses to supply
         | their beef called 100% Pure Beef Inc. so they can say 100% Pure
         | Beef on their menu.
        
           | space_fountain wrote:
           | This seems to not be true for McDonald:
           | https://www.snopes.com/fact-check/mcdonalds-100-beef/
        
             | cutierust wrote:
        
             | soheil wrote:
             | This article seems very suspect to me. This is the main
             | reason they assert why the claim is false:
             | 
             | "While this is a fascinating premise, there's nothing to
             | it: McDonald's hamburger patties in the U.S. are made with
             | 100% USDA-inspected beef. They are cooked and prepared with
             | salt, pepper and nothing else; no preservatives, no
             | fillers.
             | 
             | McDonald's of Australia's "Make Up Your Own Mind" web site
             | said the following of the rumor in its Top FAQs section:
             | Is it true that McDonald's created a company called "100%
             | Australian Beef" just so they can say that in their
             | advertising?              No."
             | 
             | So if I'm McDonald's and want to squash a negative story
             | why not throw a few bucks at the pinnacle of journalism
             | that is Snopes? (formerly Urban Legends Reference Pages)
        
               | space_fountain wrote:
               | This isn't exactly a hard story to fact check. There is 0
               | evidence for this in either the reddit thread or really
               | anywhere? If they were willing to lie about the company
               | name why not just lie about the beef in their burgers it
               | would be equally scandalous
        
               | soheil wrote:
               | The company name could be 100% legit, there is nothing
               | stopping you from a forming a company with that name and
               | not even sell beef.
        
               | sam_goody wrote:
               | It definitely happens.
               | 
               | There are at least two companies that have branded [..]
               | Kosher Gelatin(tm). One of them makes gelatin that is
               | considered non-kosher by all of the major kashrus
               | agencies.
               | 
               | "Kosher Gelatin(r)", when in the ingredients, just means
               | the product contains pork.
        
               | samatman wrote:
               | I believe that you believe this, but you got had. Pretty
               | funny though.
        
               | mrtranscendence wrote:
               | For what it's worth, I've spent a few minutes googling
               | and can't find any story that corroborates this. The only
               | US trademark I can find around "kosher gelatin" is by the
               | brand Kolatin, which is apparently certified Kosher.
        
               | jsight wrote:
               | You are right, it could be. The problem is that its the
               | kind of thing that would be almost impossible to disprove
               | if it were false. So you can always raise doubts about a
               | supposed disproof.
               | 
               | But it'd be really easy to prove if it were true and
               | noone has offered proof. And there've been plenty of
               | people who've looked for such proof, afaict.
               | 
               | My default assumption in such cases is that it is likely
               | false.
        
               | jefftk wrote:
               | If this was more than an urban legend someone would be
               | able to dig up a company with this name and some
               | indication that McD was working with them.
        
               | pessimizer wrote:
               | Something being possible to do isn't enough evidence for
               | rational people to believe that it happened. From my
               | perspective, it's possible that you're Iron Mike Tyson,
               | or that you died after your last comment and this one was
               | posted by the assassin who killed you.
        
               | soheil wrote:
               | What? I never said it's evidence that it did happen,
               | please don't make things up. I just pointed out the
               | evidence provided to refute the claim is possibly
               | invalid.
        
               | pessimizer wrote:
               | You haven't offered any evidence is the point.
        
               | soheil wrote:
               | Because I'm not trying to prove that it did or not, but
               | rather make parallels between that and OpenAI's name. For
               | I care it could be an urban legend, but who cares that's
               | not the point.
        
               | [deleted]
        
               | whichfawkes wrote:
               | In the US, for a while I remember we had billboards
               | advertising McDonald's burgers as being "1 <hamburger>
               | <hamburger>% beef". Because the hamburgers were of course
               | circular, it looked kind of like "100%".
               | 
               | I remember thinking that surely an image of a hamburger
               | does not legally constitute a zero.
        
           | leobg wrote:
           | Seems like this is an urban legend.
           | 
           | https://www.reddit.com/r/IsItBullshit/comments/2rztov/isitbu.
           | ..
        
             | soheil wrote:
             | This seems to be primarily based on the referenced Snopes
             | article https://news.ycombinator.com/item?id=32929237
        
           | amelius wrote:
           | If consumer laws are so easily circumvented then I have
           | little respect for those making these laws.
        
         | [deleted]
        
         | bambax wrote:
         | The French version is a little contrived. The speaker is a
         | native speaker, but the text is obviously the result of a
         | translation from English to French, not idiomatic French.
         | 
         | I will try to put the code to the test, see how it goes.
        
           | octref wrote:
           | I'm interested in building something with this to aid my own
           | French learning. Would love to read your findings if you end
           | up posting it somewhere like twitter/blog!
        
             | bambax wrote:
             | Tried again with Blaise Pascal -- the famous fragment of a
             | letter where he says he's sorry he didn't have enough time
             | to make it shorter.
             | 
             | Original:
             | 
             | > _Mes reverends peres, mes lettres n'avaient pas accoutume
             | de se suivre de si pres, ni d'etre si etendues. Le peu de
             | temps que j'ai eu a ete cause de l'un et de l'autre. Je
             | n'ai fait celle-ci plus longue que parce que je n'ai pas eu
             | le loisir de la faire plus courte. La raison qui m'a oblige
             | de me hater vous est mieux connue qu'a moi. Vos reponses
             | vous reussissaient mal. Vous avez bien fait de changer de
             | methode ; mais je ne sais si vous avez bien choisi, et si
             | le monde ne dira pas que vous avez eu peur des
             | benedictins._
             | 
             | Transcription:
             | 
             | > Mes reves errent peres, mais l'detre navais pas accoutume
             | de se suivre de si pres ni d'detre si etendu. Le peu de
             | temps que j'sais eu a ete cause de l'de l'de l'de autre.
             | J'sais n'detre plus longue que parce que j'sais pas eu le
             | loisir de la faire plus courte. La raison qui m'sa obligee
             | de me hater vous est mieux connue qu'moi. Vos reponses vous
             | reussissaient mal. Vous avez bien fait de changer de
             | methode, mais je ne sais pas si vous avez bien choisi et si
             | le monde ne dira pas que vous avez eu peur des benedictes.
             | 
             | Here there are many more mistakes, so many that the
             | beginning of the text is unintelligible. The language from
             | the 17th century is probably too different. Still on the
             | "medium" model, as the large one crashes the Colab (not
             | sure how to select a beefier machine.)
             | 
             | Still fascinating and exciting though.
        
               | wazoox wrote:
               | Depends on the way you're pronouncing it maybe. To be
               | intelligible IMO it must be read differently from a
               | modern text, with well sounding liaisons, and all vowels
               | very distinct: "un" sounds differently from "in", "a"
               | clearly differs from "a", "ai" and "e" from "e" and for
               | instance the "e" in "etendues" must be pronounced, though
               | not loudly.
               | 
               | My test gives that, much better than yours:
               | 
               |  _Mes *reverants* peres, mes lettres n 'avaient pas
               | accoutume de se suivre de si pres ni d'etre si etendues.
               | Le peu de temps que j'ai eu a ete cause de l'un et de
               | l'autre. Je n'ai fait celle aussi plus longue que parce
               | que je n'ai pas eu le loisir de *l'af*faire plus courte.
               | La raison qui m'a oblige de me *ra*ter vous est mieux
               | connue qu'a moi. Vos reponses vous reussiss*ez* mal. Vous
               | avez bien fait de changer de methode. Mais je ne sais si
               | vous avez bien choisi et si le monde ne dira pas que vous
               | avez eu peur des benedict*eurs*._
        
             | bambax wrote:
             | I'm playing with a Colab posted in this thread
             | (https://news.ycombinator.com/item?id=32931349), and it's
             | incredibly fun and accurate!
             | 
             | I tried the beginning of L'etranger (because you seem to be
             | a fan of Camus ;-)
             | 
             | Here's the original:
             | 
             | > _Aujourd'hui, maman est morte. Ou peut-etre hier, je ne
             | sais pas. J'ai recu un telegramme de l'asile : << Mere
             | decedee. Enterrement demain. Sentiments distingues. >> Cela
             | ne veut rien dire. C'etait peut-etre hier._
             | 
             | > _L'asile de vieillards est a Marengo, a quatre-vingts
             | kilometres d'Alger. Je prendrai l'autobus a deux heures et
             | j'arriverai dans l'apres-midi. Ainsi, je pourrai veiller et
             | je rentrerai demain soir. J'ai demande deux jours de conge
             | a mon patron et il ne pouvait pas me les refuser avec une
             | excuse pareille. Mais il n'avait pas l'air content. Je lui
             | ai meme dit : << Ce n'est pas de ma faute. >> Il n'a pas
             | repondu. J'ai pense alors que je n'aurais pas du lui dire
             | cela. En somme, je n'avais pas a m'excuser. C'etait plutot
             | a lui de me presenter ses condoleances._
             | 
             | Here's the transcription:
             | 
             | > Aujourdhui, maman est morte, peut etre hier, je ne sais
             | pas. J''ai recu un telegramme de l''asile. Mere decedee,
             | enterrement demain, sentiment distingue. Cela ne veut rien
             | dire. C''etait peut etre hier.
             | 
             | > L''asile de Vieillard est a Maringot, a 80 km d''Alger.
             | Je prendrai l''autobus a deux heures et j''arriverai dans
             | l''apres midi. Ainsi, je pourrai veiller et je rentrerai
             | demain soir. J''ai demande deux jours de conge a mon patron
             | et il ne pouvait pas me les refuser avec une excuse
             | pareille. Mais il n''avait pas l''air content. Je lui ai
             | meme dit, ce n''est pas de ma faute. Il n''a pas repondu.
             | J''ai alors pense que je n''aurais pas du lui dire cela. En
             | somme, je n''avais pas a m''excuser. C''etait plutot a lui
             | de me presenter ses condoleances.
             | 
             | Except for the weird double quotes instead of the single
             | apostrophe ('), it's close to perfect, and it only uses the
             | "medium" model.
             | 
             | This is extremely exciting and fun! Happy to try other
             | texts if you have something specific in mind!
        
             | bambax wrote:
             | Last try for tonight with Baudelaire.
             | 
             | Original:                   Trois mille six cents fois par
             | heure, la Seconde         Chuchote Souviens-toi !- Rapide,
             | avec sa voix         D'insecte, Maintenant dit Je suis
             | Autrefois,         Et j'ai pompe ta vie avec ma trompe
             | immonde !              Remember ! Souviens-toi ! prodigue !
             | Esto memor !         (Mon gosier de metal parle toutes les
             | langues )         Les minutes, mortel folatre, sont des
             | gangues         Qu'il ne faut pas lacher sans en extraire
             | l'or !
             | 
             | Transcription:
             | 
             | > Trois mille six cents fois par heure, la seconde chuchote
             | << Souviens toi >>, rapide, avec sa voix d''insecte,
             | maintenant dit << Je suis autrefois >>, et j''ai pompe ta
             | vie avec ma trompe immonde. << Remember, souviens toi,
             | prodigue, est au memoire, mon gosier de metal, parle toutes
             | les langues, les minutes, mortelles folatres, sont des
             | gangs qu''il ne faut pas lacher sans en extraire l''or. >>
             | 
             | Not bad! Far from perfect but it's a difficult text.
             | Interesting that it works better with Baudelaire than
             | Pascal.
        
           | pen2l wrote:
           | Interesting, I'm a non-native French speaker, the original
           | French piece struck me as being entirely normal (but maybe it
           | was just the perfect French accent that swayed me). Can you
           | please point out what he said which wasn't idiomatic or
           | naturally-worded French?
        
             | bambax wrote:
             | Little details. The second sentence is really bizarre:
             | 
             | > _Nous etablissons que l 'utilisation de donnees d'un tel
             | nombre et d'une telle diversite est la raison pour laquelle
             | le systeme est a meme de comprendre de nombreux accents..._
             | 
             | It doesn't sound natural at all. An idiomatic formulation
             | would be more along the lines of:
             | 
             |  _Le recours a un corpus [de donnees] si riche et varie est
             | ce qui permet au systeme de comprendre de nombreux accents_
             | (With  'corpus', 'donnees' is implied.)
             | 
             | Of course this is just an example, and I'm sure other
             | French speakers could come up with a different wording, but
             | "donnees d'un tel nombre et d'une telle diversite" sounds
             | really wrong.
             | 
             | This is also weird and convoluted:
             | 
             | > _Nous distribuons en tant que logiciel libre le code
             | source pour nos modeles et pour l 'inference, afin que
             | ceux-ci puissent servir comme un point de depart pour
             | construire des applications utiles_
             | 
             | It should at least be "le code source DE nos modeles" and
             | "servir DE point de depart", and "en tant que logiciel
             | libre" should placed at the end of the proposition (after
             | 'inference').
             | 
             | Also, "construire" isn't used for code but for buildings,
             | and "applications utiles" is unusual, because "utiles"
             | (useful) is assumed. "...pour le developpement de nouvelles
             | applications" would sound more French.
        
               | [deleted]
        
               | aGHz wrote:
               | That's interesting, as a quebecois I don't agree with any
               | of this. The only thing that raised an eyebrow was "est a
               | meme de", but if turns out it's just another way of
               | saying "capable de", I guess it's simply not a common
               | idiom around here. Aside from that, I found the wording
               | flowed well even if I personally would've phrased it
               | differently.
        
               | slim wrote:
               | Mistery solved. It was a quebecois
        
               | mazork wrote:
               | Gonna have to agree with the other reply, as a french-
               | canadian, except for "servir comme un point de depart"
               | which should be "servir de point de depart", that all
               | sounds perfectly fine.
        
               | bambax wrote:
               | If this is actually "good" or even acceptable French
               | Canadian, then it's a different language from French (and
               | the blog post should mention it).
               | 
               | I kind of doubt it though -- the speaker doesn't have a
               | Canadian accent (which is hard to miss), and in my
               | (admittedly limited) experience, French Canadian isn't
               | that different from French.
        
               | OrangeMusic wrote:
               | How funny to see that to French people, Quebec french
               | sounds like machine translated english :)
        
             | _plg_ wrote:
             | At the start, the "Nous etablissons" part, for example. You
             | wouldn't write that if you were starting scratch from
             | French.
        
               | otikik wrote:
               | That's the first thing that I discovered when I visited
               | Paris for the first time.
               | 
               | No one says "Nous", there, ever. Perhaps the politicians,
               | while giving a speech. Everyone else uses the more
               | informal "On".
               | 
               | I felt duped by my French classes.
        
             | not_math wrote:
             | You can see from the transcript where the model made some
             | errors, for example:
             | 
             | > We distribute as a free software the source code for our
             | models and for the inference [...]
             | 
             | Should be
             | 
             | > We are open-sourcing models and inference code [...]
             | 
             | Another example
             | 
             | > We establish that the use of such a number of data is
             | such a diversity and the reason why our system is able
             | [...]
             | 
             | Should be
             | 
             | > We show that the use of such a large and diverse dataset
             | leads to improved robustness [...]
        
         | DLeychIC wrote:
         | try it out here: https://huggingface.co/spaces/openai/whisper
        
         | Workaccount2 wrote:
         | Can't wait to see twelve new $49.99/mo speech parser services
         | pop up in the next few weeks.
        
           | quickthrower2 wrote:
           | Make hay before Google gives away free hay.
           | 
           | That said there is value in integration of this into other
           | things.
        
             | quickthrower2 wrote:
             | This has been running on my laptop all day for a 15 min
             | mp3! Definitely not cheap to run then (wont imagine how
             | much AWS compute cost is required).
        
         | darepublic wrote:
         | > Neat, https://github.com/openai/whisper - they have open-
         | sourced it, even the model weights, so they are living up to
         | their name in this instance.
         | 
         | Perhaps it will encourage people to add voice command to their
         | apps, which can be sent to gpt3
        
         | pabs3 wrote:
         | Is the training dataset and code open too?
        
         | suyash wrote:
         | More of this is welcome, they should live up their name and
         | original purpose and share other models (code, weights,
         | dataset) in the open source community as well.
        
         | catfan wrote:
        
       | Simorgh wrote:
       | I've been experimenting with voice-interfaces where typing is
       | replaced by talking, but I find it hard to transition users to
       | voice - we 'seem' to prefer typing to talking.
       | 
       | I wonder if this will change.
        
         | ironlake wrote:
         | Personally, I would rather type than talk when interacting with
         | a computer. The only time I use voice interfaces are when the
         | physical interface is so poor it's just easier to use voice.
         | Apple TV devices are an example of this.
        
       | shpx wrote:
       | We shouldn't call this open source. The model definition + the
       | data is the source code. The model weights are a compilation
       | artifact.
       | 
       | > The source code must be the preferred form in which a
       | programmer would modify the program. [...] Intermediate forms
       | such as the output of a preprocessor or translator are not
       | allowed.
       | 
       | > https://opensource.org/osd
       | 
       | If I asked a programmer from OpenAI to modify the model to better
       | support Japanese speakers from Hokkaido, their "preferred form"
       | of the model's source code would include the 680,000 hours of
       | audio used to train the model.
       | 
       | Yes that means that there are almost no open source models and
       | yes it's awesome that they released this and made the weights
       | available. Just don't call it open source.
        
         | nl wrote:
         | This isn't really true.
         | 
         | You can do a lot with weights and no training data - for
         | example you can pull the end layer off it and use it as a
         | feature extractor.
         | 
         | And to modify it for Japanese speakers you'd fine train the
         | existing model on additional data. If you wanted to modify the
         | model you can (sometimes, depending on what you want to do)
         | modify an existing architecture by removing layers, adding
         | replacements and fine tuning.
         | 
         | I don't quite know what the right analogy of trained data is.
         | In many ways it is more valuable than the training data because
         | the compute needed to generate it is significant. In other ways
         | it is nice to be able to inspect the data.
         | 
         | > The source code must be the preferred form in which a
         | programmer would modify the program.
         | 
         | As a machine learning programmer I'd much prefer the weights
         | than the raw data. It's no realistic for me to use that
         | training data in any way with any compute I have access to.
        
         | rvz wrote:
         | Yes. It just like calling the release of compiled closed binary
         | blobs as 'open source' even when the source of reproducing the
         | compiled output is unavailable.
         | 
         | > If I asked a programmer from OpenAI to modify the model to
         | better support Japanese speakers from Hokkaido, their
         | "preferred form" of the model's source code would include the
         | 680,000 hours of audio used to train the model.
         | 
         | Precisely. These 'users' lifting the model can't do it
         | themselves. You will still be contacting OpenAI for support or
         | to add support for another language and they will be the ones
         | able to modify the model.
         | 
         | > Just don't call it open source.
         | 
         | That is true, it is still closed source and already we are
         | seeing the hype squad already apologising to OpenAI as they
         | 'open sourced' a closed model that you can't modify yourself.
         | 
         | OpenAI is still business as usual and nothing has changed.
        
           | MacsHeadroom wrote:
           | >You will still be contacting OpenAI for support or to add
           | support for another language and they will be the ones able
           | to modify the model.
           | 
           | This isn't quite correct. The model weights are all you need
           | to fine tune the data on your own with your own audio.
           | 
           | Without the original training set this still isn't open
           | source. But you aren't powerless to modify the model without
           | the original training set.
        
         | pabs3 wrote:
         | The Debian deep learning team's machine learning policy would
         | call this a "toxic candy" model:
         | 
         | https://salsa.debian.org/deeplearning-team/ml-policy
         | 
         | BTW, wouldn't you take the existing model and do additional
         | Hokkaido Japanese speaker training on top of it, rather than
         | retraining the model from scratch?
        
       | lfmunoz4 wrote:
        
       | [deleted]
        
       | sergiotapia wrote:
       | Does this work with multiple speakers?
       | 
       | I want to build a tool that takes a video and generates subtitles
       | for it, then I want to index the subtitles and let people search
       | for a specific quote to scrub to that part of the video using
       | automatically generated urls.
       | 
       | This is for a specific fandom of a ton of content, lots of dirty
       | audio mostly recorded in a gym setting with multiple people
       | speaking.
        
         | 867-5309 wrote:
         | pretty sure such a tool made HN front page a few months ago
        
       | isoprophlex wrote:
       | Really incredible to see that their multilingual audio-to-English
       | approach is viable. I'm super excited about this, and great to
       | see that openai actually open up about something, for once.
       | 
       | Skimming the codebase I can't immediately see code to do
       | additional training.
       | 
       | Being able to fine-tune the model to a specific language or case
       | (eg. teach it specifically about some technical topic that might
       | not be so prevalent in the current train set) would be majorly
       | disruptive to current SOTA in "callcenter analytics" tech.
       | Especially when combining Whisper with GPT3.
        
       | samstave wrote:
       | AI speech recognition FN scares the heck out of me...
       | 
       | for so many reasons.
       | 
       | But one that really pisses me off is not being able to turn it
       | off on the iphone, and the fact that aside from "hidden cameras
       | in my airBnB" -- soon we will have to worry about secret
       | listening machines EVERYWHERE
        
         | jfoster wrote:
         | Also, based on their demo, this model seems like it might have
         | comprehension well above the level of a typical human.
         | 
         | Anyway, it's out there now. No way to turn back.
        
         | ma2rten wrote:
         | We will see an explosion of AI capabilities in the next couple
         | of years. This will have a huge impact on our lives, much of it
         | good but some of it also bad.
        
           | samstave wrote:
           | "Good" for ensuring you're a compliant consumer - bad if
           | you're an individual person
        
         | wongarsu wrote:
         | "Secret listening machines everywhere" was a pretty big thing
         | in East Germany. It's also the central theme of the movie The
         | Lives of Others.
         | 
         | Of course, the ability to scale this more cheaply (throwing
         | more compute at it, instead of more people) is somewhat scary,
         | but it's not really introducing a new capability. Especially
         | since you still have to do something with the transcript. An
         | AirBnB landlord who reads the transcript of what you said could
         | as well have listened to the recording.
        
           | ALittleLight wrote:
           | I think it's a new capability to add good speech to text,
           | search, and models that can understand and process text. You
           | have microphones recording speech everywhere, models turning
           | that speech into easily searchable text, and something like
           | GPT-3 reading all the speech and raising red flags for any
           | transgressive idea you please.
        
             | samstave wrote:
             | Yes, and if you want AI that is searching for "dissenters"
             | we shall soon have "speech police" or tickets or some
             | format of authoritarian punitive actions powered by this
        
               | zappy42 wrote:
               | "John Spartan, you have been fined one credit for
               | violation of the Verbal Morality Statute."
        
           | jffry wrote:
           | I'd argue that cheap, pervasive, always-on surveillance with
           | a backlog of searchable transcriptions is a qualitatively
           | different capability.
        
             | samstave wrote:
             | Exactly.
             | 
             | We are entering the next era...
             | 
             | The Kurzweil podcast appearance on Lex Fridman is nuts and
             | while I love kurzweil, holy crap even with my distopian
             | outlook he makes it even worse when you listen to even half
             | of it...
        
             | samstave wrote:
             | Exactly - imagine when we get to the point where,
             | regardless of your "crime", your punishment is 'augmented'
             | by the " _thing that you said in the past_ " AND when it
             | starts to be able to connect to APIs of your
             | social/whatever accounts and AI-Auto-Cancel you....
             | 
             | Basically digital assassination.
        
       | gareth_untether wrote:
       | I'm thinking of releasing a plugin in for Unity to that can be
       | used to match a phrase to an action. Seeing Whisper is making me
       | think I should include a way to use voice and not just text.
        
       | nothrowaways wrote:
       | Great project, not so great package name.
        
       | aidenn0 wrote:
       | I just threw a random rock MP3 at it, and a first readthrough
       | shows no transcription errors; this is quite good.
       | 
       | Now I just want OCR that's even 50% as good as this...
        
         | aidenn0 wrote:
         | Ran a few other songs through it and found one obvious
         | mistranscription:
         | 
         | "He's the bedroom cosmic rocker" (should be "He's the veteran
         | cosmic rocker" in _Veteran Cosmic Rocker_ by The Moody Blues)
         | 
         | I also noticed that it's a little on the conservative side for
         | detecting speech; all songs were missing at least part of one
         | line.
        
           | aidenn0 wrote:
           | Ran it on _Juicy_ by The Notorious B.I.G and results were
           | considerably worse than my mix of prog-rock and british
           | invasion music I had tried before, though at least some of
           | that is due to the number of proper-nouns in that song.
           | 
           | It took about 1000 CPU-minutes for this 5 minute song on my
           | Ryzen 2700 with 12 OpenMP threads (about 100 minutes wall-
           | clock).
        
             | antegamisou wrote:
             | Here's the output of                   whisper never-gonna-
             | give-you-up.mp3 --language English --model small
             | [00:00.000 --> 00:27.000]  We're no strangers to love You
             | know the rules and so do I         [00:27.000 -->
             | 00:35.000]  I feel commitments while I'm thinking of You
             | wouldn't get this from any other guy         [00:35.000 -->
             | 00:43.000]  I just wanna tell you how I'm feeling Gotta
             | make you understand         [00:43.000 --> 00:47.000]
             | Never gonna give you up Never gonna let you down
             | [00:47.000 --> 00:53.000]  Never gonna run around and
             | desert you Never gonna make you cry         [01:00.000 -->
             | 01:09.000]  We've known each other for so long Your heart's
             | been aching but you're too shy to say         [01:09.000
             | --> 01:17.000]  Inside we both know what's been going on We
             | know the game and we're gonna play it
             | 
             | It was running for quite a long time (20 minutes) on my
             | admittedly low-budget specs.
             | 
             | Note that I did not omit 00:53.000 -> 01:00.000.
             | 
             | Shouldn't there be some type of unintelligible warning
             | since it wasn't able to transcribe that part?
        
               | aidenn0 wrote:
               | Model small is about as good at recognizing lyrics as an
               | untrained Newton was at recognizing handwriting.
               | 
               | Here's a comparison of _Basket Case_ by Greenday:
               | 
               | Small:                   [00:00.000 --> 00:05.000]  Do
               | you have the time to listen to me whine
               | [00:05.000 --> 00:10.000]  About nothing and everything
               | I'll have once?         [00:11.000 --> 00:16.000]  I am
               | one of those melodramatic fools         [00:16.000 -->
               | 00:20.000]  Neurotic to the bone, no doubt about it
               | [00:23.000 --> 00:27.000]  Sometimes I give myself the
               | creeps         [00:27.000 --> 00:32.000]  Sometimes my
               | mind plays tricks on me         [00:32.000 --> 00:38.000]
               | It all keeps headed up, I think I'm pregnant
               | [00:38.000 --> 00:43.000]  And I'm just paranoid, I'm
               | just stuck         [00:47.000 --> 00:52.000]  I went to a
               | shrink to have a life like my dreams         [00:52.000
               | --> 00:57.000]  She says it's like a sex that's bringing
               | me down         [00:57.000 --> 01:03.000]  I went to a
               | whore, he said my life's a bore         [01:03.000 -->
               | 01:08.000]  Choked with my widest buzz that's bringing
               | her down         [01:10.000 --> 01:14.000]  Sometimes I
               | give myself the creeps         [01:15.000 --> 01:19.000]
               | Sometimes my mind plays tricks on me         [01:19.000
               | --> 01:25.000]  It all keeps headed up, I think I'm
               | pregnant         [01:25.000 --> 01:30.000]  And I'm just
               | paranoid, I'm just stuck         [01:30.000 -->
               | 01:48.000]  Grasping to control, it's all I better hold
               | on         [02:08.000 --> 02:12.000]  Sometimes I give
               | myself the creeps         [02:13.000 --> 02:17.000]
               | Sometimes my mind plays tricks on me         [02:18.000
               | --> 02:23.000]  It all keeps headed up, I think I'm
               | pregnant         [02:23.000 --> 02:30.000]  And I'm just
               | paranoid, I'm just stuck         [02:53.000 -->
               | 03:13.000]  Thanks for watching!
               | 
               | Medium:                   [00:00.000 --> 00:05.000]  Do
               | you have the time to listen to me whine
               | [00:05.000 --> 00:10.000]  About nothing and everything
               | all at once?         [00:11.000 --> 00:16.000]  I am one
               | of those melodramatic fools         [00:16.000 -->
               | 00:20.000]  Neurotic to the bone, no doubt about it
               | [00:23.000 --> 00:27.000]  Sometimes I give myself the
               | creeps         [00:27.000 --> 00:32.000]  Sometimes my
               | mind plays tricks on me         [00:33.000 --> 00:36.000]
               | It all keeps adding up         [00:36.000 --> 00:39.000]
               | I think I'm cracking up         [00:39.000 --> 00:41.000]
               | Am I just paranoid?         [00:41.000 --> 00:43.000]  Am
               | I just sad?         [00:47.000 --> 00:50.000]  I went to
               | a shrink         [00:50.000 --> 00:53.000]  To analyze my
               | dreams         [00:53.000 --> 00:58.000]  She says it's
               | lack of sex that's bringing me down         [00:58.000
               | --> 01:01.000]  I went to a whore         [01:01.000 -->
               | 01:04.000]  He said my life's a bore         [01:04.000
               | --> 01:09.000]  So quit my whining cause it's bringing
               | her down         [01:10.000 --> 01:14.000]  Sometimes I
               | give myself the creeps         [01:16.000 --> 01:20.000]
               | Sometimes my mind plays tricks on me         [01:20.000
               | --> 01:23.000]  It all keeps adding up         [01:23.000
               | --> 01:26.000]  I think I'm cracking up
               | [01:26.000 --> 01:28.000]  Am I just paranoid?
               | [01:28.000 --> 01:30.000]  Am I just sad?
               | [01:40.000 --> 01:44.000]  Grasping to control
               | [01:44.000 --> 01:50.000]  So I better hold on
               | [02:07.000 --> 02:11.000]  Sometimes I give myself the
               | creeps         [02:11.000 --> 02:16.000]  Sometimes my
               | mind plays tricks on me         [02:16.000 --> 02:19.000]
               | It all keeps adding up         [02:19.000 --> 02:22.000]
               | I think I'm cracking up         [02:22.000 --> 02:24.000]
               | Am I just paranoid?         [02:24.000 --> 02:52.000]  Am
               | I just sad?         [02:54.000 --> 02:58.000]  Thanks for
               | watching!
        
           | macrolocal wrote:
           | For what it's worth, even the large model balks on Easy
           | (Aesop Rock), eg.
           | 
           | "Fountainheads spittle sniglets quicker than quidditch
           | seekers snatch golden snitches."
           | 
           | becomes
           | 
           | "Stirred up out mids bittles, snicklets, cricket and
           | quidditch seekers net golden snitches."
           | 
           | -\\_(tsu)_/-
        
             | aidenn0 wrote:
             | Large was not obviously better than medium when I tried it.
             | My impression was that it tended to fit more to a language
             | model than the sounds heard, which corrected some errors
             | and introduced some others, but I didn't try a lot of songs
             | because large won't run on my GPU.
        
       | sjsdaiuasgdia wrote:
       | I was comparing a batch of transcriptions between these models
       | and vosk, and noticed that the medium.en model produces some
       | weird results compared to the others. I've seen a number of loops
       | with one word or a small sequence of words repeating several
       | times. It seems more prone to output that reads like nonsense
       | than the others.
       | 
       | More troubling is a short audio clip that got a few full
       | sentences back, several times the text length that comes back
       | from the other models or vosk. The content of the sentences is
       | extremely far from the audio content. The best alignment I can
       | find is the first word of medium.en's interpretation is somewhat
       | phonetically similar to the audio.
       | 
       | The small.en model doesn't show these behaviors, at least in this
       | data set.
        
         | nshm wrote:
         | The whole value of this model is in 680 000 hours of training
         | data and to reuse this value you need large model, not smaller
         | ones. Smaller versions just don't have enough capacity to
         | represent training data properly.
        
       | powera wrote:
       | My first take: it is slow.
       | 
       | The "base" model (supposedly 16x faster than the large one) takes
       | more than the audiofile playback time on my machine to do
       | transcriptions.
        
         | fitznd wrote:
         | I'm seeing even worse. On my M1 Max 2021 macbook pro, I tried
         | transcribing a 30 minute video file and left it on overnight
         | and it was only half way through. I feel like something could
         | be wrong with my setup but I'm only using the defaults.
        
       | archibaldJ wrote:
       | Is this practical to be used on the "edge" (for voice-control)?
       | Would love to know if anyone has a rough idea roughly how
       | fast/slow this would be on a M1 Mac or V100
        
       | LoveMortuus wrote:
       | This could be used to make some really cool RPG games!
        
       | funhighway wrote:
       | Would be nice to give more details about the provenance and
       | construction of the training data.
        
       | [deleted]
        
       | catfan wrote:
        
       | rlt wrote:
       | As a casual observer I get the sense that OpenAI and others are
       | very rapidly creating building blocks of something much bigger...
        
       | StevenWaterman wrote:
       | That example at the top of the page (speed talking) blew me away.
       | He started talking, I was stunned for a minute, then realised
       | yes, it really was English, and I just burst out laughing.
       | 
       | That's so, so far beyond the previous state-of-the-art, it's
       | absurd.
        
         | NaturalPhallacy wrote:
         | It's a micromachines ad from the '80s. He talked like that in
         | all of them!
         | 
         | As for speed, to a computer we don't talk very fast, not even
         | that guy.
         | 
         | I wonder if it could handle Rap God by Eminem....Let's find
         | out!
        
           | dreamer7 wrote:
           | Did you find out :D?
        
             | arpankapoor wrote:
             | I did! There are a few places it transcribes incorrectly,
             | but overall I'm very impressed. Here's the first ~30
             | seconds:                   [00:00.000 --> 00:09.000]  Look,
             | I was going to go easy on you, not to hurt your feelings,
             | but I'm only going to get this one chance.
             | [00:09.000 --> 00:11.000]  Something's wrong, I can feel
             | it.         [00:11.000 --> 00:17.000]  It's just a feeling
             | I've got, like something's about to happen, but I don't
             | know what.         [00:17.000 --> 00:21.000]  If that means
             | what I think it means, we're in trouble, big trouble.
             | [00:21.000 --> 00:24.000]  Had to be as bananas as you say,
             | I'm not taking any chances.         [00:24.000 -->
             | 00:26.000]  You're just one to die for.         [00:26.000
             | --> 00:32.000]  I'm beginning to feel like a rap god, rap
             | god. All my people from the front to the back nod, back
             | nod.
        
             | madacol wrote:
        
             | NaturalPhallacy wrote:
             | It was doing it _slowly_ , but hadn't got to the insane bit
             | when I killed it to try and get it working with CUDA, so I
             | had to do some digging and it turns out I need a version of
             | pytorch with CUDA enabled, and so I had to go and install
             | Anaconda, and now now conda is stuck trying to "solve" my
             | environment to install pytorch with CUDA.
             | 
             | So...probably?
             | 
             | Pre-post edit: I can't get it to work.
             | 
             | I've installed pytorch with cuda via pip3, installed the
             | nVidia toolkit and it doesn't see it:
             | 
             | >>> import torch >>> torch.cuda.is_available() False
             | 
             | I've wasted like an hour and a half on it now. I'm not a
             | python dev, and don't have any ML experience so this was
             | just for fun and now it's not anymore.
        
               | mlboss wrote:
               | Try running pytorch/pytorch docker. But you will need
               | nvidia container runtime installed. I am sure somebody
               | will soon release docker for this also.
        
               | forgingahead wrote:
               | Welcome to every single Python ML project - dependency
               | hell will quickly kill any enthusiasm one may have for
               | trying out projects. It really feels archaic to have
               | these issues with such cutting edge technology.
        
               | MayeulC wrote:
               | You can blame CUDA quite a bit for that. Proprietary, you
               | need to sort out which driver you need, plus an nvidia
               | GPU...
               | 
               | I tried compiling pytorch with vulkan support, but there
               | are a few LDFLAGS that are wrong. I'll try to solve that
               | some time later.
               | 
               | One piece of advice: use distribution packages! Arch
               | provides pytorch-cuda, and has PKGBUILDS as well.
               | 
               | For reproductibility, I wish we were all on Nix/Guix, but
               | that's not the case (and CUDA+HW dependency would make it
               | complicated).
        
               | forgingahead wrote:
               | CUDA is not the problem, the problem is crappy code being
               | released on Github where basic things like
               | requirements.txt are missing, never mind an earnest
               | attempt to provide details about the environment that the
               | code was running on. This is on top of code that has lots
               | of hard-coded references to files and directories, plus
               | also many python libraries just breaking compatibility
               | with each other on point releases.
               | 
               | I can't find a source now, but I remember reading some
               | code where the maintainer had to change a huge chunk of
               | code because the point change for a dependency library
               | literally flipped either how the library handled
               | height/width or BGR channels (I can't remember which one
               | but it was preposterous) from the 2.5.4 to the 2.5.5
               | version. There is no reason for doing that - it breaks
               | everything just for grins and giggles.
               | 
               | Python itself is also a problem, but that's a rant for
               | another day. Ah, how I wish Ruby had become the defacto
               | language of choice for ML/Deep Learning!
        
       | catfan wrote:
        
       | londons_explore wrote:
       | @dang Can we change the link to the github here[1]?
       | 
       | It seems to describe the project better for a technical audience.
       | 
       | [1]: https://github.com/openai/whisper
        
       | toss1 wrote:
       | Like every model I've seen there is something like this:
       | 
       | >>A decoder is trained to predict the corresponding text...
       | 
       | Prediction of expected text in the context of the previous text.
       | 
       | While this is valuable in casual transcription, it can be
       | extremely dangerous in serious contexts.
       | 
       | From personal experience, having given a deposition with an "AI"
       | transcription, it will literally reverse the meanings of
       | sentences.
       | 
       | This is because it produces the _EXPECTED_ output in a context,
       | and _NOT THE ACTUAL OUTPUT_.
       | 
       | Like a speaker that clips the output, these types of systems
       | 'clip' the really valuable information out of a transcription.
       | Worse yet, this is a completely silent failure, as the transcript
       | _LOOKS_ really good.
       | 
       | Basic info theory shows that there is more information contained
       | in 'surprising' chunks of data than in expected ones. These
       | systems actively work to substitute 'expected' speech to
       | overwrite 'surprising' speech.
       | 
       | The transcript I got was utter trash, multiple pages of errata I
       | had to submit when the normal is a couple of lines. And as I
       | said, some literally reversed the meaning in a consequential way,
       | and yet completely silently.
       | 
       | This kind of silent active failure mode is terrifying. Unless it
       | is solved, and I see no way to solve it without removing ALL
       | predictive algos from the system, these types of systems must not
       | be used in any situation of serious consequence, at least not
       | without real redundancy and backup.
        
         | lunixbochs wrote:
         | Do you have a demo audio clip for this? I'd be interested to
         | see how it looks in practice.
        
           | toss1 wrote:
           | Sorry, I don't have anything available.
           | 
           | One item I remember was that I said "Dr Kemeny" in relation
           | to Dartmouth College (he was a famous mathematician, invented
           | the BASIC programming language and was president of the
           | college). It replaced those instances with "Jack Kennedy".
           | 
           | In another instance, I said that "Evidently, you have a
           | reading comprehension problem.". It replaced it with
           | "Evidently, I have a ...", completely reversing the meaning.
           | 
           | There was zero problems with the microphones or audio, and it
           | was not rushed or mumbled talk. There were 80+ other examples
           | over a few hours of talking, and some from other speakers.
           | And those were just the obvious ones I could catch.
           | 
           | Another massive problem with this technology is that a human
           | stenographer can notice when s/he missed something and didn't
           | hear and ask the speaker to repeat or clarify what was said,
           | and will often during a pause request clarification on
           | spelling of names, addresses, etc. In contrast, this "AI"
           | technology just barges ahead ASSuming that it knows what it
           | is doing and inserts literally whatever sounds good in the
           | transcript, completely silent that it doesn't have a clue.
           | 
           | Having seen this up close, I'm of the strong opinion that
           | anyone foisting this software on the market without huge
           | warnings that this is not usable for any critical functions
           | is, basically a fraud. They know or certainly should know
           | that these failures not only exist but are common and
           | systemic, yet they barge along like it is OK. It is not.
        
         | Tomis02 wrote:
         | I've been saying this for years. Current "AI" algorithm are
         | fundamentally flawed because they rely on a statistical
         | approach. This works moderately well for some use cases but it
         | will rarely give you 100% confidence. Good luck with self-
         | flying planes or self-running nuclear power plants.
        
           | toss1 wrote:
           | >>Current "AI" algorithms are fundamentally flawed because
           | they rely on a statistical approach.
           | 
           | YES! The old joke about "Artificial Stupidity" is actually
           | more true than anyone realized.
           | 
           | These statistical so-called-AI systems actually work to
           | actively REMOVE or sanitize out any unexpected information,
           | making it all conform with the EXPECTED results from the
           | training set.
           | 
           | This not only REMOVES the most high-information 'surprising'
           | or unexpected nuggets, it actively HIDES them. When something
           | unexpected comes up, it gets force fit into the expected
           | prediction algorithms and output as if it were good.
           | 
           | I'm not saying that there are no useful things that can be
           | done with this technology -- there is a LOT of mundane work
           | out there to be done.
           | 
           | But, we will never get this type of "AI" saying "Huh, that's
           | odd, I wonder why that is?", which is exactly the kind of
           | observation that leads a prepared and fertile mind to great
           | discoveries.
        
       | sowbug wrote:
       | I knew there was a reason why I kept my MP3 library even after
       | subscribing to Spotify. Now piping everything through whisper. So
       | far the generated lyrics are reasonable, though it thinks the REM
       | song says "Linnie Bruce is not afraid."
       | 
       | No surprise that it appears to have successfully transcribed all
       | the recordings of Harvard Sentences I could find.
       | https://en.wikipedia.org/wiki/Harvard_sentences
        
       | Havoc wrote:
       | This could be really cool for mycraft/rasphy etc
        
       | sva_ wrote:
       | It seems like Stable AIs release has led to some real disruption
       | in the ML field regarding open source, and this doesn't seem to
       | be limited to image generation. Excited to see what comes next.
        
       | jasan_s wrote:
        
         | jasan_s wrote:
        
       | hijp wrote:
       | Anyone get it running on m1 mac?
       | 
       | I keep getting `ModuleNotFoundError: No module named
       | 'setuptools.command.build'`
        
         | simmanian wrote:
         | I got it working inside a docker container on my M1 MBP. FWIW,
         | I'm having my $180 tinyminimicro PC run a translation task
         | while my M1 MBP runs a transcription task with the same audio
         | input. So far, the PC is actually outputting results a lot
         | faster than the MBP. Interesting results.
        
         | kif wrote:
         | I got requirements installed, but then when running the Python
         | example, I get:
         | 
         | RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
        
           | kif wrote:
           | Probably need to pass some kind of options when initializing.
           | The command itself works fine, just shows a warning:
           | warnings.warn("FP16 is not supported on CPU; using FP32
           | instead")
        
             | mewse-hn wrote:
             | using this in the sample code worked for me:
             | 
             | >>> options = whisper.DecodingOptions(fp16=False)
        
         | dceddia wrote:
         | Yep, I had this too. `pip3 install -U pip setuptools` took care
         | of it. (If you get an error about pip3, try `pip` instead)
        
           | hijp wrote:
           | I'm really new to pip, but does this look ok?
           | 
           | (after running the command for setuptools) Defaulting to user
           | installation because normal site-packages is not writeable
           | Requirement already satisfied: pip in
           | /Users/xxx/Library/Python/3.9/lib/python/site-packages
           | (22.2.2) Requirement already satisfied: setuptools in
           | /Users/xxx/Library/Python/3.9/lib/python/site-packages
           | (65.3.0)
           | 
           | ---- after trying whisper installation: x Getting
           | requirements to build wheel did not run successfully. | exit
           | code: 1 +-> [20 lines of output] Traceback (most recent call
           | last): File "/Users/xxx/Library/Python/3.9/lib/python/site-
           | packages/pip/_vendor/pep517/in_process/_in_process.py", line
           | 363, in <module> main() File
           | "/Users/xxx/Library/Python/3.9/lib/python/site-
           | packages/pip/_vendor/pep517/in_process/_in_process.py", line
           | 345, in main json_out['return_val'] =
           | hook(*hook_input['kwargs']) File
           | "/Users/xxx/Library/Python/3.9/lib/python/site-
           | packages/pip/_vendor/pep517/in_process/_in_process.py", line
           | 130, in get_requires_for_build_wheel return
           | hook(config_settings) File "/Library/Developer/CommandLineToo
           | ls/Library/Frameworks/Python3.framework/Versions/3.9/lib/pyth
           | on3.9/site-packages/setuptools/build_meta.py", line 154, in
           | get_requires_for_build_wheel return self._get_build_requires(
           | File "/Library/Developer/CommandLineTools/Library/Frameworks/
           | Python3.framework/Versions/3.9/lib/python3.9/site-
           | packages/setuptools/build_meta.py", line 135, in
           | _get_build_requires self.run_setup() File "/Library/Developer
           | /CommandLineTools/Library/Frameworks/Python3.framework/Versio
           | ns/3.9/lib/python3.9/site-packages/setuptools/build_meta.py",
           | line 150, in run_setup exec(compile(code, __file__, 'exec'),
           | locals()) File "setup.py", line 2, in <module> from
           | setuptools_rust import Binding, RustExtension File "/private/
           | var/folders/lj/7x6d3dxd3cbdtt484k6xsmyh0000gn/T/pip-build-
           | env-ieaydl8r/overlay/lib/python3.9/site-
           | packages/setuptools_rust/__init__.py", line 1, in <module>
           | from .build import build_rust File "/private/var/folders/lj/7
           | x6d3dxd3cbdtt484k6xsmyh0000gn/T/pip-build-env-
           | ieaydl8r/overlay/lib/python3.9/site-
           | packages/setuptools_rust/build.py", line 23, in <module> from
           | setuptools.command.build import build as CommandBuild # type:
           | ignore[import] ModuleNotFoundError: No module named
           | 'setuptools.command.build' [end of output]
           | note: This error originates from a subprocess, and is likely
           | not a problem with pip.
           | 
           | error: subprocess-exited-with-error
        
             | dceddia wrote:
             | Nope, that doesn't look good! I honestly just googled the
             | error and installing setuptools fixed it for me, but I
             | barely know anything about the Python ecosystem so I'm
             | really just fumbling around here.
        
               | hijp wrote:
               | haha same, thanks
        
             | mvexel wrote:
             | Not quite sure if this is related, but since there's a
             | bunch of statements in there referencing rust: I had to
             | install the rust compiler on my Mac (`brew install rust` if
             | you use homebrew). This is not mentioned in the
             | installation instructions.
        
         | Smaug123 wrote:
         | I'm still not successfully using the GPU, but it's working
         | decently quickly (with the base model - it's incredibly slow to
         | use the Large model) using just the CPU. I'm going to have to
         | check what magic stable-diffusion is doing to enable the GPU :(
        
           | dceddia wrote:
           | There's a --device flag you can pass. I've been trying to get
           | `--device cuda` to work on my Windows machine and it's saying
           | that torch wasn't compiled with CUDA. Trying to figure out
           | what's going on there.
           | 
           | And on the M1, supposedly PyTorch has support for hardware
           | acceleration using MPS (Metal Performance Shaders, announced
           | here https://pytorch.org/blog/introducing-accelerated-
           | pytorch-tra...) but when I tried `--device mps` it blew up
           | with an error "input types 'tensor<1x1280x3000xf16>' and
           | 'tensor<1xf32>' are not broadcast compatible".
        
             | magicalhippo wrote:
             | > I've been trying to get `--device cuda` to work on my
             | Windows machine and it's saying that torch wasn't compiled
             | with CUDA.
             | 
             | I struggled with the same. Here's what worked for me:
             | 
             | Use pip to uninstall pytorch first, should be "pip
             | uninstall torch" or similar.
             | 
             | Find the CUDA version you got installed[1]. Go to PyTorch
             | get started page[2] and use their guide/wizard to generate
             | the pip string, and run that. I had to change pip3 to pip
             | FWIW, and with Cuda 11.6 installed I ended up with "pip
             | install torch torchvision torchaudio --extra-index-url
             | https://download.pytorch.org/whl/cu116".
             | 
             | After that I could use --device cuda, and the difference
             | was immense. On my 2080Ti it went from roughly an hour for
             | a minute with large model, to 10-20 seconds.
             | 
             | [1]: https://stackoverflow.com/a/55717476
             | 
             | [2]: https://pytorch.org/get-started/locally/
        
             | Smaug123 wrote:
             | Yep, same for me, on M1 after enabling MPS (with
             | `model.to("mps")`) it just either SIGSEGV or SIGABRTs every
             | time with that line. The extremely unclean nature of the
             | abort is making it hard to debug :(
        
               | dceddia wrote:
               | I noticed the size seems to correspond to the model. With
               | a large model, the error is tensor<1x1280x3000xf16>. With
               | tiny, it's tensor<1x384x3000xf16>, and with medium it's
               | tensor<1x1024x3000xf16>. It also seems like a bad thing
               | that those are f16's but the "expected" data is f32.
        
               | Smaug123 wrote:
               | I'm giving up for the night, but
               | https://github.com/Smaug123/whisper/pull/1/files at least
               | contains the setup instructions that may help others get
               | to this point. Got it working on the GPU, but it's...
               | much much slower than the CPU? Presumably due to the
               | 'aten::repeat_interleave.self_int' CPU fallback.
               | 
               | Also hitting a nice little PyTorch bug:
               | 
               | > File "/Users/patrick/Documents/GitHub/whisper/whisper/d
               | ecoding.py", line 388, in apply logits[:,
               | self.tokenizer.encode(" ") + [self.tokenizer.eot]] =
               | -np.inf
               | 
               | > RuntimeError: dst_.nbytes() >= dst_byte_offset INTERNAL
               | ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pyto
               | rch/aten/src/ATen/native/mps/operations/Copy.mm":200,
               | please report a bug to PyTorch.
        
       | faizsn wrote:
       | Faizan
        
       | nik_s wrote:
       | I just tested the model [1] using an RTX3090, trying to translate
       | a french text I found here [2].
       | 
       | Some observations:
       | 
       | - The full translation of the 6:22 minute video takes about 22
       | seconds (17x real time)
       | 
       | - It recognizes the language by default (and did a good job to
       | recognize it was french audio)
       | 
       | - MIT License [3]!
       | 
       | - The quality of the transcription is good, but not perfect.
       | 
       | - The quality of the translation (if you don't consider
       | transcription errors as a translation error) is generally very
       | good.
       | 
       | ---
       | 
       | The transcription:
       | 
       | > Bonjour a tous, <error>j'suis</error> espere que vous allez
       | bien, c''est ENTI. Et aujourd', <error>aujourd',</error> on se
       | retrouve <error>un peu physique</error> pour parler de la termo
       | dynamique. Vous ne vous inquietez pas, ca va bien se passer. On
       | va y aller ensemble, <error>etre a par exemple,</error> je vous
       | accompagne a travers une serie de videos pour vous expliquer les
       | principes de base en termo dynamique. Et bah, c''est parti, on va
       | y aller tranquillement. Lidee, c''est vous puissiez comprendre la
       | termo dynamique dans son ensemble. Donc, je vais vraiment prendre
       | mon temps pour <error>couplisser</error> bien comprendre les
       | notions,
       | 
       | The translation:
       | 
       | > Hello everyone, I hope you're doing well, it's NT and today we
       | find ourselves a little physical to talk about the thermo
       | dynamic. Don't worry, it's going well, we're going to go together
       | and be the same. I'm going to accompany you through a series of
       | videos to explain the basic principles in thermo dynamic. Well,
       | let's go, <error>we're going to go quietly</error>. The idea is
       | that you can understand the thermo dynamic <error>in sound
       | together</error>. So I'm really going to take my time to
       | understand the notions,
       | 
       | ---
       | 
       | All in all very happy that OpenAI is publishing their models. If
       | Stable Diffusion is any guide, people will hack some crazy things
       | with this.
       | 
       | [1] https://github.com/openai/whisper [2]
       | https://www.youtube.com/watch?v=OFLt-KL0K7Y [3]
       | https://github.com/openai/whisper/blob/main/LICENSE
        
         | seszett wrote:
         | > _dans son ensemble_
         | 
         | > _in sound together_
         | 
         | That's hilarious and honestly, incredibly bad. "Dans son
         | ensemble" is a very common idiom (meaning "as a whole") while
         | "in sound together" has to be pretty rare. "Son" means
         | "his/hers/its" as well as "sound", and the former meaning is
         | probably more common in general so I have no idea how this
         | result could arise.
         | 
         | "Termo" also doesn't exist in French, it's "thermo", so the
         | transcript even makes orthographic errors.
         | 
         | And I forgot about "couplisser" which is also a hilarious made-
         | up word that sounds like it could mean something, but doesn't!
         | _Edit_ Google finds exactly one reference of this, in a patent
         | with a typo on the word  "coulisser".
         | 
         | I'm still impressed by the transcript quality since it covers
         | many languages, but the translation part is quite poor.
        
         | StevenWaterman wrote:
         | Was this with the `base` model? `large` is running ok on a P100
         | in colab, but is about 4% the speed of `base.en`. Certainly
         | seems like some of these models will be fast enough for real-
         | time.
        
         | NaturalPhallacy wrote:
         | How did you get it to use the GPU?
         | 
         | I have it running right now and it's not touching the GPU.
        
           | ramblerman wrote:
           | --device "cuda"
        
             | NaturalPhallacy wrote:
             | My version of pytorch didn't have CUDA. I had to install
             | conda to get it, and now it's currently installing.
             | 
             | Whatever the default version that `pip install
             | git+https://github.com/openai/whisper.git` grabbed didn't
             | include it by default.
        
         | joshcryer wrote:
         | It also runs well on a CPU and seems to have proper memory
         | management. Wonderful timing because I was using DeepSpeech for
         | some audio recordings and it required me to script up a
         | splitter to make the files into .wav and then do snippets of 10
         | seconds each. Everything about this just works out of the box.
         | On a core i5 I'm getting about 30 seconds every minute.
         | Transcriptionist jobs just turned into editor jobs. I love how
         | it drops the inflections in the audio as well, because it was
         | trained on transcription work, and that is one of the first
         | things you learn to do (drop the uhs and ums and huhs etc,
         | unless it is a strictly verbose transcription).
        
         | solarmist wrote:
         | Is it translation or transcription? Or both?
         | 
         | Both, wow. This is really interesting.
        
           | StevenWaterman wrote:
           | Both, the blog covers it in detail. Pass in audio in any
           | language, and get an English transcription out.
        
           | nik_s wrote:
           | It can do both - I've edited my original post to show the
           | translation task.
        
       | gok wrote:
       | Comparing this model's word error rates to the state of the art
       | [1] on a few common test sets:
       | Whisper    SoTA       LibriSpeech test-clean      2.7%     1.8%
       | LibriSpeech test-other      5.6%     2.9%       Switchboard
       | 13.1%     4.9%       CallHome                   15.8%     9.5%
       | 
       | The authors do explicitly state that they're trying to do a lot
       | of fancy new stuff here, like be multilingual, rather than
       | pursuing just accuracy.
       | 
       | [1] https://github.com/syhw/wer_are_we
        
         | lunixbochs wrote:
         | I suspect Whisper is more robust than other "SOTA" models, but
         | this release is likely leaving a fair bit of accuracy on the
         | table considering the amount of resources OpenAI is capable of
         | throwing at training it.
         | 
         | Comparing the readily available test sets from the paper to
         | some of my personal robust models (for the Talon models, this
         | is greedy decoding, no language model):
         | Talon  Talon  Talon  Whisper  wav2vec 2.0
         | 28M    300M   1B     Large    960h         librispeech clean
         | 3.21   2.52   2.40   2.7      2.7         librispeech other
         | 8.21   6.56   5.63   5.6      6.2         common voice
         | 13.88  11.65   8.86   9.5     29.9         tedlium
         | 7.51   6.55   5.47   4.0     10.5
         | 
         | I have a battery of more difficult tests on hand (including
         | adversarial tests, and diverse accent-specific metrics). I'll
         | look at running these tests on each of the Whisper model sizes
         | and following up with a larger comparison.
        
           | allanrbo wrote:
           | Talon was the first thing that came to my mind when I saw
           | this news. Would be nice if it could benefit from Whisper.
           | (Big fan of your work on Talon!)
        
           | ma2rten wrote:
           | I'm looking forward to your comparison. It's really hard to
           | make sense of how good this model actually is without being
           | an expert in the area.
        
           | nshm wrote:
           | It is interesting how they compare with wav2vec2 instead of
           | nemo conformer (which is more accurate) in Table 2.
        
         | StevenWaterman wrote:
         | One of the things they point out is that the SoTA on e.g.
         | LibriSpeech is _only_ good at LibriSpeech, and doesn 't
         | generalise as well.
         | 
         | > Because Whisper was trained on a large and diverse dataset
         | and was not fine-tuned to any specific one, it does not beat
         | models that specialize in LibriSpeech performance, a famously
         | competitive benchmark in speech recognition. However, when we
         | measure Whisper's zero-shot performance across many diverse
         | datasets we find it is much more robust and makes 50% fewer
         | errors than those models.
        
           | lunixbochs wrote:
           | My own experience agrees: the generally available "SOTA"
           | models are not especially robust, and can be _extremely_ bad
           | (>50% absolute error rate) at some tasks. I'll post some
           | preliminary numbers in a sibling comment and look into
           | running my full set of tests on Whisper.
           | 
           | It looks like Whisper is probably leaving a lot of accuracy
           | on the table, but initially it does seem to be a lot more
           | robust than general "SOTA" models.
           | 
           | For a quick comparison, Silero's accuracy charts are kind of
           | nice because they post results for a large variety of
           | datasets. Scroll down to the EN V6 xlarge EE model (not the
           | xlarge CE) [1]
           | 
           | [1] https://github.com/snakers4/silero-models/wiki/Quality-
           | Bench...
        
       | wodenokoto wrote:
       | Is it also a translation model? All the example transcripts are
       | in English, regardless of the language of the purportedly
       | transcribed audio.
       | 
       | The description makes it sound like it is a model for
       | transcribing English audio.
       | 
       | > We've trained and are open-sourcing a neural net called Whisper
       | that approaches human level robustness and accuracy on English
       | speech recognition.
        
       | michelb wrote:
       | Quite a high error rate on a very clean-spoken Dutch audio, but
       | way better than anything else I have tried.
        
       | jawadch93 wrote:
        
       | LanternLight83 wrote:
       | Hoping to see this out to use in open source voice assistants,
       | eg. mycroft
        
       | sn41 wrote:
       | Most of the comments here are about law enforcement. I would like
       | to point out that it might be a boon for dictation software. This
       | may make it easier to dictate text/code etc. in any environment.
        
       | liminalsunset wrote:
       | I really wish I had this about half a year ago when I was
       | building a tool to automatically turn online school lectures into
       | searchable, clickable transcripts (kind of like YouTube or EdX
       | transcripts).
       | 
       | I was originally using Adobe Premiere Pro's speech to text to do
       | it, and wrote Python to convert its output to the Hyperaudio
       | format on GitHub. With this, I can totally skip all of that step
       | and this is fully open source, too.
       | 
       | App idea:
       | 
       | Build an app that takes a video and uses Hyperaudio or a similar
       | project to add a clickable and searchable transcript (clicking in
       | transcript seeks video)
        
         | resoluteteeth wrote:
         | You could already do the speech recognition in a fully open
         | source way with vosk easily, although Whisper may be more
         | accurate
        
       | txtai wrote:
       | Check out this notebook for an example on how to run Whisper as a
       | txtai pipeline in Python or as an API service:
       | https://colab.research.google.com/github/neuml/txtai/blob/ma...
        
       | z3t4 wrote:
       | Why not make a demo that you can try out via
       | navigator.mediaDevices.getUserMedia . Of course you will get good
       | results if you demo using the training set.
        
       | synergy20 wrote:
       | is there a high quality text to speech equivalent project like
       | this?
        
         | bergenty wrote:
         | Seriously, when I first landed on the page without reading
         | anything else I thought it was text to speech with the "micro
         | machine" example and I was floored. The speech to text is
         | obviously mind blowing too.
        
       | throwamon wrote:
       | Is it feasible to use this for Talon-like voice-driven computer
       | usage?
        
         | lunixbochs wrote:
         | If the Whisper models provide any benefits over the existing
         | Talon models, and if it's possible to achieve any kind of
         | reasonable interactive performance, I will likely integrate
         | Whisper models into Talon.
         | 
         | Talon's speech engine backend is modular, with Dragon, Vosk,
         | the WebSpeech API, and Talon's own engine all used in different
         | ways by users.
        
         | FloatArtifact wrote:
         | Maybe, a number of speech recognition engines have been
         | integrated into https://github.com/dictation-toolbox/dragonfly
        
       | dubeye wrote:
       | I know a manual transcription company, which is still seeing
       | modest growth from existing clients who also use ASR, so it's not
       | quite there yet
        
       | londons_explore wrote:
       | I wonder how much the 30 second window is impacting performance?
       | 
       | Anecdotally, I feel like there are plenty of times that I need
       | context from more than 30 seconds ago to understand some
       | technical jargon that's being discussed.
        
       | pmarreck wrote:
       | So it's 100% better than Siri's speech dictation, I see
        
       | eugenhotaj wrote:
       | Now someone just needs to pipe the output into stable diffusion.
        
       | chrisstanchak wrote:
       | Hold on to your papers
        
       | smusamashah wrote:
       | How well does it do for technical and domain oriented speech? For
       | example I have audio recordings of a senior explaining some very
       | technical aspects of our software. Will it understand the
       | technical terms in that speech?
       | 
       | I guess I will need to download and run on it to see how correct
       | it is.
        
       | emcq wrote:
       | Be wary of using this model - the licensing of this model seems
       | sketchy. Several of the datasets used for training like WSJ and
       | TED-LIUM have clear non-commercial clauses. I'm not a lawyer but
       | releasing a model as "MIT" seems dubious, and hopefully OpenAI
       | has paid for the appropriate licenses during training as they are
       | no longer a research-only non profit.
        
         | nshm wrote:
         | I think they didn't use WSJ for training, only for evaluation.
         | Paper includes WSJ under "Evaluation datasets"
        
         | jefftk wrote:
         | This is a big dispute right now: OpenAI and other AI companies
         | generally take the position that models learning from data does
         | not make the output of the models a derivative work of that
         | data. For example, GitHub Co-pilot uses all publicly available
         | GitHub code regardless of license, and
         | DALLE-2/StableDiffusion/etc use lots of non-free images. I
         | don't think this has been challenged in court yet, and I'm very
         | curious to see what happens when it is.
        
           | petercooper wrote:
           | I think it might be even less problematic with something like
           | Whisper than with DALLE/SD? Merely consuming data to train a
           | system or create an index is not usually contrary to the law
           | (otherwise Google wouldn't exist) - it's the _publication_ of
           | copyright content that 's thorny (and is something you can
           | begin to achieve with results from visual models that include
           | Getty Photos logo, etc.)
           | 
           | I think it'd be a lot harder to make a case for an accurate
           | audio to text transcription being seen to violate the
           | copyright of any of the training material in the way a visual
           | could.
        
             | jefftk wrote:
             | They're not just training a system but publishing the
             | trained system
        
           | emcq wrote:
           | This is even slightly more direct: access to WSJ data
           | requires paying LDC for the download, and the pricing varies
           | depending on what institution / license you're from. The cost
           | may be a drop in the bucket compared to compute, but I don't
           | know that these licenses are transferable to the end product.
           | We might be a couple court cases away from finding out but I
           | wouldn't want to be inviting one of those cases :)
        
           | bscphil wrote:
           | > models learning from data does not make the output of the
           | models a derivative work of that data
           | 
           | Most of the debate seems to be happening on the question of
           | whether _everything_ produced by models trained on
           | copyrighted work represents a derivative work. I argue that
           | at the very least _some_ of it does; so the claim said to be
           | made by the AI companies (see quote above) is clearly a false
           | one.
           | 
           | We're in a weird place now where AI is able to generate "near
           | verbatim" work in a lot of cases, but I don't see an obvious
           | case for treating this any differently than a human
           | reproducing IP with slight modifications. (I am not a
           | lawyer.)
           | 
           | For example, copyright law currently prevents you from
           | selling a T-shirt with the character Spider-Man on it. But
           | plenty of AI models can give you excellent depictions of
           | Spider-Man that you could put on a T-shirt and try to sell.
           | It's quite silly to think that any judge is going to take you
           | seriously when you argue that your model, which was trained
           | on a dataset that included pictures of Spider-Man, and was
           | then asked to output images using "Spider-Man" as a search
           | term, has magically circumvented copyright law.
           | 
           | (I think there's a valid question about whether models
           | represent "derivative work" in the GPL sense specifically,
           | but I'm using the idea more generally here.)
        
             | jefftk wrote:
             | That's right: the model is definitely capable of creating
             | things that are clearly a derivative work of what they were
             | trained on. But this still leaves two questions:
             | 
             | * Does the model require a copyright license? Personally I
             | think it's very likely a derivative work, but that doesn't
             | necessarily mean you need a license. The standard way this
             | works in the US is the four factors of fair use
             | (https://copyright.columbia.edu/basics/fair-use.html) where
             | Factor 1 is strongly in favor of the model being
             | unrestricted while 2-4 are somewhat against (and in some
             | cases 4 is strongly against).
             | 
             | * Is all output from the model a derivative work of all of
             | the input? I think this is pretty likely no, but unclear.
             | 
             | * Does the model reliably only emit derivative works of
             | specific inputs when the user is trying to get it to do
             | that? Probably no, which makes using one of these models
             | risky.
             | 
             | (Not a lawyer)
        
       | zeagle wrote:
       | It would be exceptional to get a healthy competitor to
       | microsoft/nuance's dragon monopoly on voice recognition in
       | healthcare. At a couple thousand bucks a license and the more
       | recent SaaS subscription trend there is a lot of money to be made
       | in that space.
        
       | darkpicnic wrote:
       | I just wrote a script with Hazel to automatically transcribe my
       | voice notes to txt. It handles punctuation extremely well. What a
       | wonderful contribution!
        
         | pbassham wrote:
         | Exactly what I was planning to do! Want to share yours with me?
        
       | braindead_in wrote:
       | Why build a separate model when you can integrate it right into
       | GPT?
        
       | harry8 wrote:
       | Can you plug this into a computer on your premises to get speech
       | recognition without amazon, apple or google's cloud (or any other
       | cloud) involvement?
       | 
       | Right now I decline all speed recognition because I don't want
       | orwellian listening devices in my house or pocket and haven't
       | seen an answer. (Also haven't been too bothered about speech
       | command interfaces to bother with a load of research - lazy me).
        
         | fragmede wrote:
         | Yes, after the download of the model weights (from
         | https://openaipublic.azureedge.net/) it's an entirely offline
         | process.
        
       | abidlabs wrote:
       | Here [1] is a video tutorial on building a web UI that accepts
       | microphone input and runs it through Whisper for speech
       | transcription
       | 
       | [1]
       | https://www.youtube.com/watch?v=ywIyc8l1K1Q&ab_channel=1litt...
        
         | amrrs wrote:
         | Thank you for sharing!
        
       ___________________________________________________________________
       (page generated 2022-09-22 23:02 UTC)