[HN Gopher] Whisper - open source speech recognition by OpenAI
___________________________________________________________________
Whisper - open source speech recognition by OpenAI
Author : _just7_
Score : 1577 points
Date : 2022-09-21 16:16 UTC (1 days ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| wongarsu wrote:
| > About a third of Whisper's audio dataset is non-English, and it
| is alternately given the task of transcribing in the original
| language or translating to English. We find this approach is
| particularly effective at learning speech to text translation and
| outperforms the supervised SOTA on CoVoST2 to English translation
| zero-shot.
|
| That's intriguing. You can just set the model to transcribe
| everything into English, no matter which language the speaker is
| using, and it just works. Given that many people are much better
| at understanding English than at speaking it, this might make
| voice interfaces much more accessible without much work.
| FloatArtifact wrote:
| This would be a cool thing to integrate into Dragonfly
| https://github.com/dictation-toolbox/dragonfly
| synkarius wrote:
| It would. I wonder how this compares with Kaldi, one of the two
| open source speech recognition engines that Dragonfly currently
| supports.
| rexreed wrote:
| I'd love to find a way to test this with longer audio but I don't
| have GPU resources and not exactly sure how to load that into the
| Colab. Is anyone planning on hosting or sharing a model that can
| be used by others to test longer form audio (for podcast
| transcription)?
| londons_explore wrote:
| I've never seen transcription and translation combined into a
| single step like this before...
|
| Have I been living under a rock, or is this new?
|
| I assume it should help performance, because it means emphasis,
| timing and tone can be used to inform the translation. Helps make
| better guesses about information missing from the source
| language.
| jerpint wrote:
| I recorded myself speaking French and was able to translate
| decently well on my laptop. Very impressive!
| jfoster wrote:
| It seems like OpenAI are finally living up to their name for once
| with this release? Anything I'm missing?
|
| From what I can gather:
|
| 1. Includes model weights. I can't find the URL, but they
| reference them enough and have a CLI tool, so I presume I just
| haven't found them yet.
|
| 2. Includes code: https://github.com/openai/whisper
|
| 3. Released under MIT License:
| https://github.com/openai/whisper/blob/main/LICENSE
| thesausageking wrote:
| It's one model and in a non-strategic area where there are
| existing open source projects (Kaldi, DeepSpeech, ...).
|
| For a company that raised $1B, that's not exactly living up to
| their name and original mission.
| blagie wrote:
| Yes. The same is true of many products from many companies.
|
| I feel bad about GPT-3 and DALL-E being released under the
| terms they were, but I don't feel bad about this. I'm not
| going to condemn OpenAI for the good things they did, but I
| will hold them accountable for bad things or good ones they
| didn't do.
|
| I'd given up on OpenAI being open or ethical, but this is a
| start. It took them down from "evil super-villain" status to
| mere villain.
| whimsicalism wrote:
| > It's one model and in a non-strategic area where there are
| existing open source projects (Kaldi, DeepSpeech, ...).
|
| I can already tell this is much better than any of the
| existing open source projects with the exception of the wav2*
| sequence of projects and potentially nvidia's nemo.
| thesausageking wrote:
| Kaldi is an open, pluggable framework and is a ton more
| flexible and powerful than this. It's used by hundreds of
| teams, including a number of consumer tech companies you've
| heard of. They're not going to move to this over it.
|
| Especially because ASR is a living organism. You have to
| constantly update your language model as new people, ideas,
| and words move into the normal lexicon. As people start
| talking about "COVID", "metaverse", "king charles", or
| whatever new things that happen, these need to be added to
| your language model. You need these updates monthly at a
| minimum and OpenAI didn't release the raw data which means
| you can't retrain it even if you wanted to spend the
| time/resources to.
|
| So, this is an interesting research project and helpful for
| small teams and side projects, but it's unlikely it makes
| any real impact on the industry.
| whimsicalism wrote:
| Kaldi just is not fast or high quality enough compared to
| other modern alternatives like wav2letter. I appreciate
| that it is more flexible than this, it certainly is - but
| I am not so sure about "powerful."
| [deleted]
| StevenWaterman wrote:
| (Model weights from
| https://github.com/openai/whisper/blob/main/whisper/__init__...
| )
|
| "tiny.en": "https://openaipublic.azureedge.net/main/whisper/mod
| els/d3dd5..."
|
| "tiny": "https://openaipublic.azureedge.net/main/whisper/models
| /65147..."
|
| "base.en": "https://openaipublic.azureedge.net/main/whisper/mod
| els/25a85..."
|
| "base": "https://openaipublic.azureedge.net/main/whisper/models
| /ed3a0..."
|
| "small.en": "https://openaipublic.azureedge.net/main/whisper/mo
| dels/f953a..."
|
| "small": "https://openaipublic.azureedge.net/main/whisper/model
| s/9ecf7..."
|
| "medium.en": "https://openaipublic.azureedge.net/main/whisper/m
| odels/d7440..."
|
| "medium": "https://openaipublic.azureedge.net/main/whisper/mode
| ls/345ae..."
|
| "large": "https://openaipublic.azureedge.net/main/whisper/model
| s/e4b87..."
| mmastrac wrote:
| Large is 3GB to save everyone a click. Tiny is 72MB.
| anigbrowl wrote:
| That's unexpectedly lightweight - enough to run in some
| phones.
| yencabulator wrote:
| However, https://github.com/openai/whisper#available-
| models-and-langu... says requires ~1 GB VRAM.
| solarmist wrote:
| This kind of model is harder to abuse, so I guess it passed
| their internal checks much more easily.
|
| I can understand not releasing GPT-3, even if I disagree with
| the decision.
| ignoramous wrote:
| > _This kind of model is harder to abuse, so I guess it
| passed their internal checks much more easily._
|
| The version I choose to believe: _stability.ai_ ate DALL-E
| for lunch, and that woke them up.
| solarmist wrote:
| This is probably also true.
| jfoster wrote:
| True. The potential of GPT-3 to cause internet mayhem was/is
| significant. I would argue that the mere act of announcing it
| was still a catalyst for an eventual GPT-3-like model being
| released. In revealing it, they established a target for what
| open source models could aim to achieve, and simultaneously
| got bad actors thinking about ways to abuse it.
| zarzavat wrote:
| It was a credible argument when GPT-3 was released. But now
| there are open models that are as capable as GPT-3 and that
| mayhem has not materialized, with the possible exception of
| GPT-4chan. They could release it now under a non-commercial
| license, if they cared to.
| jfoster wrote:
| Can you provide an example of an open model as capable as
| GPT-3?
|
| I know there's some "mini-GPT" type models around, but
| they don't seem nearly as capable.
| dwohnitmok wrote:
| > I can understand not releasing GPT-3, even if I disagree
| with the decision.
|
| Why do you disagree?
| bigyikes wrote:
| I don't see how GPT-3 is any more dangerous than Stable
| Diffusion, Photoshop, that fake news website the crazy
| person you're friends with on Facebook really likes, or any
| of the number of other tools and services that can be used
| to generate or spread fake information.
| jfoster wrote:
| All of your examples are limited in some way, but GPT-3
| wouldn't have any meaningful limits.
|
| Stable Diffusion: Marks images as AI-generated.
| (invisible watermark, but still, it's there)
|
| Photoshop: Requires time & effort from a human.
|
| Fake news website: Requires time & effort from a human.
| xkapastel wrote:
| I wouldn't really say Stable Diffusion marks images as
| AI-generated. There's a script in the Stable Diffusion
| repository that will do that, but it's not connected to
| the model itself in a meaningful way. I use Stable
| Diffusion a lot and I've never touched this script.
|
| https://github.com/CompVis/stable-
| diffusion/blob/69ae4b35e0a...
| capableweb wrote:
| What "script" are you using for doing txt2img? The
| watermark function is automatically called when you use
| the CLI in two places, https://github.com/CompVis/stable-
| diffusion/blob/69ae4b35e0a... and
| https://github.com/CompVis/stable-
| diffusion/blob/69ae4b35e0a...
|
| Trivial to remove, I give you that. But AFAIK, the
| original repository + most forks put the watermark
| automatically unless you've removed it on your own.
| serf wrote:
| >Trivial to remove, I give you that. But AFAIK, the
| original repository + most forks put the watermark
| automatically unless you've removed it on your own.
|
| almost all of the 'low-vram' variant forks either have an
| argument to turn off the watermark (it saves a bit of
| memory) or come with it disabled all together.
| nullc wrote:
| It would be pretty trivial to have an invisible watermark
| in GPT3 output-- though you don't really need one: just
| score text with gpt3 to find out if it was likely gpt3
| generated or not.
| spullara wrote:
| SD only does that if you don't delete the line of code
| that does it...
| [deleted]
| mmh0000 wrote:
| Because why should the wealthy and connected be the only
| ones -allowed- have access to such life improving
| technology?
| solarmist wrote:
| Two reasons. First, someone else will release something
| similar. Second, I didn't see a related push from them to
| work with other in the industry to do something productive
| towards safety with the time they got by delaying
| availability of these kinds of models. So it felt
| disingenuous.
| moyix wrote:
| Several groups already have. Facebook's OPT-175B is
| available to basically anyone with a .edu address (models
| up to 66B are freely available) and Bloom-176B is 100%
| open:
|
| https://github.com/facebookresearch/metaseq
|
| https://huggingface.co/bigscience/bloom
| solarmist wrote:
| Yup. I meant when it had just come out.
| bredren wrote:
| This is dropping right in the middle of Interspeech 2022.
|
| I don't believe OpenAI has anyone presenting at the conference,
| so presumably this was timed to coincide with that and get buzz
| at the conference.
|
| Curious how this model compares with foss STT from the startup
| Coqui.
| Tistron wrote:
| It understands my Swedish attempts at English really well with
| the medium.en model. (Although, it gives me a funny warning:
| `UserWarning: medium.en is an English-only model but receipted
| 'English'; using English instead.`. I guess it doesn't want to be
| told to use English when that's all it can do.)
|
| However, it runs very slowly. It uses the CPU on my macbook,
| presumably because it hasn't got a NVidia card.
|
| Googling about that I found
| [plaidML](https://github.com/plaidml/plaidml) which is a project
| promising to run ML on many different gpu architectures. Does
| anyone know whether it is possible to plug them together somehow?
| I am not an ML researcher, and don't quite understand anything
| about the technical details of the domain, but I can understand
| and write python code in domains that I do understand, so I could
| do some glue work if required.
| revskill wrote:
| It's actually better than Google Meet subtitle system.
| blueberrychpstx wrote:
| This is absolute garbage python as I am neither a python
| developer, nor a good developer. I was trying to play around with
| real time transcriptions. However, it does work!
|
| > * recording * done recording Recording saved to file.wav Press
| enter to transcribe
|
| /Users/laptop/Development/Personal/Public/pythonProject1/venv/lib
| /python3.9/site-packages/whisper/transcribe.py:70: UserWarning:
| FP16 is not supported on CPU; using FP32 instead
| warnings.warn("FP16 is not supported on CPU; using FP32 instead")
| Detected language: english Goodbye, I need to go pick up my wife.
| Press enter to start recording
|
| Any improvements welcome here.
|
| ``` # This is a sample Python script.
|
| # Press ^R to execute it or replace it with your code. # Press
| Double | to search everywhere for classes, files, tool windows,
| actions, and settings.
|
| def print_hi(name): # Use a breakpoint in the code line below to
| debug your script. print(f'Hi, {name}') # Press [?]F8 to toggle
| the breakpoint.
|
| def record_microphone(seconds): import pyaudio import wave
| CHUNK = 1024 FORMAT = pyaudio.paInt16 CHANNELS =
| 1 RATE = 44100 RECORD_SECONDS = seconds
| WAVE_OUTPUT_FILENAME = "file.wav" p =
| pyaudio.PyAudio() stream = p.open(format=FORMAT,
| channels=CHANNELS, rate=RATE,
| input=True, frames_per_buffer=CHUNK)
| print("* recording") frames = [] for i
| in range(0, int(RATE / CHUNK * RECORD_SECONDS)): data
| = stream.read(CHUNK) frames.append(data)
| print("* done recording") stream.stop_stream()
| stream.close() p.terminate() wf =
| wave.open(WAVE_OUTPUT_FILENAME, 'wb')
| wf.setnchannels(CHANNELS)
| wf.setsampwidth(p.get_sample_size(FORMAT))
| wf.setframerate(RATE) wf.writeframes(b''.join(frames))
| wf.close() return WAVE_OUTPUT_FILENAME
|
| if __name__ == '__main__': seconds = 5 while True: print("Press
| enter to start recording") input() filename =
| record_microphone(seconds) print("Recording saved to " +
| filename) print("Press enter to transcribe") input() import
| whisper model = whisper.load_model("base")
| result = model.transcribe(filename)
| print(result["text"])
|
| ```
| yawnxyz wrote:
| Oh man I remember LOVING Micro Machines as a kid.
|
| But also, this tool seems much better than Otter.ai, which gets
| every third word wrong when transcribing microbiology recordings
| alexb_ wrote:
| Combine the translation + transcription with voice synthesis, and
| once compute power allows for this to be miniaturized we will be
| able to have babel-fish technology in real life.
| no1youknowz wrote:
| This is awesome. But I really want the other way.
|
| To be able to give it text and hear the speech. A TTS (text to
| speech).
|
| As a language learner, the ability to create my own sentences
| (based on existing ones I have, in changing a word here or
| there). Would be amazing.
|
| How long till we have this I wonder. I know I could use a service
| to do this currently. But having something running locally, I'd
| prefer.
|
| Hopefully someone in the OpenAI team reads this. :)
| freedomben wrote:
| Likewise, TTS is what I really want. My goal is to be able to
| create audio books from text. I've been using Amazon Polly and
| it's acceptable quality, but I would be ecstatic to be able to
| do it locally on my own hardware.
| visarga wrote:
| Check out NaturalReader. It has hundreds of amazing voices, a
| system for highlighting text as it is being read, works on
| books (pdf) and webpages, and is available on phones and in
| browsers on all platforms. So I could have the same voice on
| Mac, Linux and iPhone.
| TaylorAlexander wrote:
| I suspect this is coming. I mean we do have decent text to
| speech systems already, but in this vein of "we used neural
| networks and now it's very very good" you can imagine that with
| something like GPT-3, to extend it they could use this speech
| to text system so you could speak to it for input, and then a
| natural progression is that it can use text to speech to return
| the output, so you just have a voice oriented conversational
| system.
|
| So I think TTS is a logical part of the system. I also think
| that there are peculiarities of voice interaction that aren't
| captured in text training datasets, so they would need to do
| some fine tuning on actual voice conversation to make it feel
| natural.
|
| All in due time I suppose.
| visarga wrote:
| A full NLP system would include speech recognition, TTS, a
| large language model, and a vector search engine. The LM
| should be multi modal, multi language and multi task, "multi-
| multi-model" for short haha. I'm wondering when we'll have
| this stack as default on all OSes. We want to be able to
| search, transcribe, generate speech, run NLP tasks on the
| language model and integrate with external APIs by intent
| detection.
|
| On the search part there are lots of vector search companies
| - Weaviate, Deepset Haystack, Milvus, Pinecone, Vespa, Vald,
| GSI and Qdrant. But it has not become generally deployed on
| most systems, people are just finding out about the new
| search system. Large language models are still difficult to
| run locally. And all these models would require plenty of RAM
| and GPU. So the entry barrier is still high.
| TaylorAlexander wrote:
| Ah very interesting thank you. I'm not familiar with
| research in to vector search, I'll look that up.
|
| But yeah you make a good point about LLMs being too large
| to run on a normal PC. I do somewhat suspect that we might
| see some rapid acceleration in the size of neural network
| processors as large models begin to offer more utility. I
| think for now they have limited appeal but we're already
| seeing things like Tesla's Dojo make large leaps in
| capability to rapidly process complex networks.
|
| In five to ten years we may see built in accelerators come
| standard in most computers capable of running very complex
| models. Already Apple provides ever more powerful
| accelerators in their phones. You could imagine Adobe
| offering real time diffusion models as part of Photoshop,
| among other things.
| noreally_ wrote:
| A notebook is available to try with your microphone on Colab
| here: https://colab.research.google.com/drive/1nBZ-
| pDIaIi3N1DIIXvJ...
|
| I'm surprised by the quality on non-English languages, given that
| 80+% of the training data is English, and the rest is split
| between tens of languages.
| bambax wrote:
| Thanks! I played with this in French and posted the results as
| replies to this comment:
| https://news.ycombinator.com/item?id=32928643
|
| It's sometimes close to perfect, and sometimes goes off the
| rail; I think that maybe the model tries to establish some sort
| of consistency for each sentence; if starts wrong for the first
| few words of a sentence, it can't build the rest properly.
|
| But it's super fun.
| berberous wrote:
| How do you get this to translate instead of just transcribe?
| paraschopra wrote:
| Just specify language and record an audio in another
| language.
|
| >result = model.transcribe("audio.wav", language="english")
| berberous wrote:
| That actually seems to set the language for it to
| transcribe (as opposed to it guessing), with the following
| triggering a translation to English:
|
| result = model.transcribe("audio.wav", task="translate")
|
| But your post helped me figure out the above, so thank you!
| tekacs wrote:
| To be more specific than the above:
|
| 1. Make sure you're using a model that isn't suffixed with
| `.en` (`base`, not `base.en). 2. Use
| `model.transcribe(your_input_audio, language='Japanese',
| task='translate')` ... with the appropriate input language.
| goffi wrote:
| Really interesting, I can see ton of potential uses.
|
| 2 questions:
|
| 1) how does it compare to state of the art FOSS solutions? I'm
| seeking about DeepSpeech or Vosk
|
| 2) would it be somehow possible to associate timestamp to the
| words recognized? That would be amazing for things such as audio
| editing or skipping to a particular location on a video
| nshm wrote:
| You properly mentioned timestamps. There are many other
| important properties of good ASR system like vocabulary
| adaptability (if you can introduce new words) or streaming. Or
| confidences. Or latency of the output. Compared to Vosk models
| this model can not work in streaming manner, so not very
| suitable for real-time applications.
|
| But in general the model is robust and accurate and trained on
| the amount of speech we never dreamed about in Vosk. We will
| certainly benefit from this model as a teacher (together with
| others like gigaspeech models). I recently wrote about it
| https://alphacephei.com/nsh/2022/06/14/voting.html
| goffi wrote:
| > goffi
|
| for 2), it's actually written in the description: "phrase-level
| timestamps", so it should be possible (phrase level is neat for
| skipping to a special location on a video, but maybe not for
| audio editing).
| catfan wrote:
| IceWreck wrote:
| Is there a list of system requirements somewhere ? Can it run on
| cheaper low memory GPUs ? maybe CPUs ?
| yjftsjthsd-h wrote:
| On my ancient desktop it happily fell back to running on CPU
| just fine.
| StevenWaterman wrote:
| Their models range from 70mb to 3gb. The largest model is
| smaller than the optimised stable diffusion. Not sure what the
| inference speed is like, haven't tried it myself yet.
| IceWreck wrote:
| I just tested it myself. Its fast enough on colab, couple of
| seconds but not sure if its fast enough to transcribe
| realtime audio yet.
| lynguist wrote:
| "small" runs in realtime on Macbook Air M1 CPU.
| MacsHeadroom wrote:
| Colab is using one of the larger models. Tiny probably runs
| in realtime on a single core of an RPi.
| [deleted]
| mewse-hn wrote:
| I know this isn't a tech support forum but maybe someone here
| knows. I'm attempting the sample python code from the github and
| _almost_ get a transcription running on my work laptop without a
| GPU, but I run into this error message:
|
| >>> result = whisper.decode(model, mel, options)
|
| Traceback (most recent call last):
|
| [snip]
|
| RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
|
| It looks like a Torch error, is there some twiddling with
| "options" I can do to get it to run?
| mewse-hn wrote:
| I seem to have worked around it by tweaking the "options" line
| from the sample code to this:
|
| >>> options = whisper.DecodingOptions(fp16=False)
| ignite wrote:
| I am running on work laptop not using GPU. (I'm running in
| docker). I just get warnings.warn("FP16 is
| not supported on CPU; using FP32 instead")
|
| And it works.
| arpankapoor wrote:
| I tried it out on a Hindi speech
| (https://www.youtube.com/watch?v=4EpfJxKyosE). The transcription
| starts off decent, but kind of gets stuck repeating the same
| thing at the 02:40 mark: [00:00.000 -->
| 00:10.000] pcaas taal meN hmne prgtii kiye, isse ko iNttaar
| nhiiN kr sktaa / [00:10.000 --> 00:20.000] chunaao ke
| dauraan vott maaNgte hue, srkaar kii niitiyoN pr ktthor se ktthor
| prhaar krte hue, [00:20.000 --> 00:28.000] aur puraanii
| srkaar kii niitiyoN nhiiN aalocnaa krne ke lie laik bhut saamgrii
| thii / [00:28.000 --> 00:35.000] hr jge maiNne ye khaa
| ki maiN un logoN meN se nhiiN huuN, jo pcaas vrc kii uplddyoN pr
| paanii phir de / [00:35.000 --> 00:43.000] aisaa krnaa
| desh ke purssaarth pr paanii phirnaa hogaa / aisaa krnaa desh ke
| kisaan ke saath anyaay krnaa hogaa / [00:43.000 -->
| 01:01.000] mlduur ke saath jaattii krnii hogaa / aam aadmii ke
| saath bhii vo acchaa vyohaar nhiiN hogaa / jo svaal aaj mn meN
| ucchaa hai aur ucchnaa caahii hai / aadaavii ko pcaas saath hone
| aaye, hm jaintii mnaane jaa rhe haiN / [01:01.000 -->
| 01:18.000] aaj desh kii stitii kyaa hai / hm pichr ke hoge haiN
| / prgtii kii dodd' meN, jo desh hmaare saath aajaad hue the, vo
| hm se aage bddh' ke / jo desh hmaare baac jn meN the, vo hmeN
| piice chodd' the / [01:18.000 --> 01:34.000] duniyaa ke
| grii tm deshoN meN hmaarii gdd'n aaye / viis phiij'ii se jaanaa
| lo griibii kii rekaa ke niice / raaktptii mhudaay ke vibhaashn
| meN gaauuN kaa ullek haiN naa piire kaa paanii nhiiN /
| [01:34.000 --> 01:50.000] hm praathmii shikssaa anivaare nhiiN
| kr skte haiN / lddkiyoN kii shikssaa kii upekssaa ho rhii haiN /
| lddki kaa jnm lenaa to is desh meN abhii tk ek abhishaap hai /
| [01:50.000 --> 02:07.000] kyaa srkaarii kdm utthaakr smaaj meN
| jaagdRtii paidaa krkeN / kyaa sb logoN ko juttaakr ye to aisaa
| kaam hai jis meN koii dlbNdii ke lie isthaan nhiiN / hm desh kaa
| nkssaa nhiiN bdl skte haiN / desh meN saadhnoN kii kmii nhiiN
| hai / [02:07.000 --> 02:07.000] aur saadhnoN kii agr
| kmii hai to usko tthiik dnt se praapt kiyaa jaa sktaa hai /
| saadhn bdd'aae bhii jaa skte hai / lekin jo saadhn haiN unkaa
| tthiik upyog nhiiN ho rhaa / jNtaa ke upr tteks lgaakr jo dnni
| kptaa kiyaa jaataa hai / uskaa laag jNtaa tk nhiiN phu
| [02:37.000 --> 02:37.000] rkhkm jaatii hai / videshii baiNko
| meN dn jaane kaa silsilaa abhii tk kyoN kaaeN hai / usko lokne
| ke lie kyaa kdm utthaaege / hm videshii puujii ke lie
| praitrshiil haiN videshii puujii aae aur agr videshii puujii
| aatii hai acche dnt kii ttek [03:07.000 --> 03:07.000]
| acche dnt kii puujii aatii hai acche dnt kii puujii aatii hai
| acche dnt kii puujii aatii hai acche dnt kii puujii aatii hai
| [03:37.000 --> 03:39.000] acche dnt kii puujii aatii hai acche
| dnt kii puujii aatii hai [04:07.000 --> 04:09.000] acche
| dnt kii puujii aatii hai acche dnt kii puujii aatii hai
| [04:37.000 --> 04:39.000] acche dnt kii puujii aatii hai acche
| dnt kii puujii aatii hai
|
| The translation does a much better job however:
| [00:00.000 --> 00:10.000] In the last 50 years, we have made
| progress, no one can deny this. [00:10.000 --> 00:20.000]
| During the elections, while asking for votes, while attacking the
| government's policies harshly, [00:20.000 --> 00:28.000]
| and to criticize the policies of the old government, a lot of
| material was needed. [00:28.000 --> 00:35.000]
| Everywhere, I have said that I am not one of those people who
| pour water on the fruits of 50 years. [00:35.000 -->
| 00:39.000] To do this, we will have to pour water on the efforts
| of the country. [00:39.000 --> 00:43.000] To do this, we
| will have to do injustice with the farmers of the country.
| [00:43.000 --> 00:45.000] We will have to do caste with the
| laborers. [00:45.000 --> 00:50.000] Even with the common
| man, that will not be a good behavior. [00:50.000 -->
| 00:55.000] The question that arises in the mind today and should
| arise, [00:55.000 --> 01:01.000] Freedom has come to be
| 50 years, we are going to celebrate. [01:01.000 -->
| 01:04.000] What is the situation of the country today?
| [01:04.000 --> 01:07.000] Why did we get separated?
| [01:07.000 --> 01:14.000] In the race of progress, the country
| that got freedom along with us, they went ahead of us.
| [01:14.000 --> 01:19.000] The country that was after us, they
| left us behind. [01:19.000 --> 01:25.000] In the poorest
| countries of the world, they counted us. [01:25.000 -->
| 01:29.000] 20% of the population is below the poverty line.
| [01:29.000 --> 01:35.000] In the speech of the President, there
| is no mention of villages or drinking water. [01:35.000
| --> 01:39.000] We cannot enforce primary education.
| [01:39.000 --> 01:43.000] The education of girls is being
| neglected. [01:43.000 --> 01:50.000] The birth of a girl
| is still a curse in this country. [01:50.000 -->
| 01:55.000] Is it by taking government steps, by creating
| awareness in the society? [01:55.000 --> 02:01.000] Is
| it by uniting all the people that there is no place for party?
| [02:01.000 --> 02:05.000] Can't we change the map of the
| country? [02:05.000 --> 02:08.000] There is no shortage
| of resources in the country. [02:08.000 --> 02:14.000]
| And if there is a shortage of resources, it can be obtained in
| the right way, resources can be increased. [02:14.000 -->
| 02:21.000] But the resources that are there, they are not being
| used properly. [02:21.000 --> 02:30.000] The wealth that
| is collected by taxing the public, its profit does not reach the
| public, it does not reach the common man. [02:30.000 -->
| 02:32.000] Where does it go? [02:32.000 --> 02:35.000]
| Whose pockets are filled? [02:35.000 --> 02:39.000]
| Whose treasury does that money go to? [02:39.000 -->
| 02:44.000] Why is the chain of money going to foreign banks
| still established? [02:44.000 --> 02:47.000] What steps
| have been taken to stop it? [02:47.000 --> 02:52.000] We
| are motivated for foreign worship, foreign worship has come.
| [02:52.000 --> 03:01.000] And if foreign worship comes for good
| technology, for infrastructure, [03:01.000 --> 03:06.000]
| for education, then no one will object. [03:06.000 -->
| 03:11.000] I believe that our communist friends will not object
| either. [03:11.000 --> 03:19.000] But is the maximum use
| of the resources in the country happening? [03:19.000 -->
| 03:26.000] Is it not true that corruption has become a national
| disease? [03:26.000 --> 03:31.000] I remember that
| Swargi Rajiv Gandhi had said in a speech that I send one rupee
| from Delhi, [03:31.000 --> 03:36.000] but where I send
| the rupee, as I reach there, 19 paise are left.
| [03:36.000 --> 03:41.000] I asked him how this miracle happens.
| [03:41.000 --> 03:47.000] Bhaskar said that when the rupee runs,
| it shrinks. [03:47.000 --> 03:54.000] The rupee shrinks,
| it gets into the hand, it goes into the pocket, it becomes small.
| [03:54.000 --> 03:58.000] It is difficult to recognize the
| rupee. [03:58.000 --> 04:02.000] The rupee can be
| hidden. [04:02.000 --> 04:06.000] The situation of the
| currency of the country is not good. [04:06.000 -->
| 04:10.000] First, the government expenditure has increased, it
| is increasing. [04:10.000 --> 04:17.000] It needs common
| consent to reduce without reducing. [04:17.000 -->
| 04:24.000] No one can work in the same way. [04:24.000
| --> 04:27.000] Yes, our old Prime Minister Narasimha Raoji,
| [04:27.000 --> 04:34.000] if he would have tried in this
| direction after stabilizing himself, then he would have
| succeeded. [04:34.000 --> 04:47.000] But he was stuck in
| some such things that he could not pay attention to these
| problems.
| O__________O wrote:
| Anyone know if it is possible to output IPA using this?
|
| International Phonetic Alphabet (IPA)
|
| - https://wikipedia.org/wiki/International_Phonetic_Alphabet
|
| _________
|
| EDIT: Based on list of languages in the tokenizer code here,
| doesn't appear IPA is supported:
|
| https://github.com/openai/whisper/blob/5f8d4bcc254d4f3e833d3...
| gck1 wrote:
| Got my hopes high that there's finally an open source solution
| that can deal with Georgian language, only to get my hopes
| brutally destroyed. It successfully detects a language and then
| produces garbage. Passing language manually produced similar
| results.
|
| Result of my own recording: Detected language:
| georgian yichiyannaaisw remnants ts founding hockey slee
| syi eling bhthwaaicularly qfiithoAPPLAUSEPS thDavPin Dao pDING
| Mozai pryadk uk aa orchestral uk aa arter uu BrettM
| hilarious l ryy ywaa vark pk * Poll statements lypson. ch`ch`r
| uesi[?]meislemveerrshueairelmirisasasssesserersiveesrrilmexre
| reimimep`emsese
|
| Results of clear Georgian audio [1].
|
| On tiny model: Detected language: georgian
| [00:00.000 --> 00:21.560] en [00:21.560 --> 00:23.240] Wo
| Lun Lun ... [00:23.280 --> 00:43.720] Wo Lun Lun Lun Lun
| Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun
| Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun
| Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun
| Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun
| Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun
| Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Lun Yin
| Wei b forestry
|
| On medium model: Detected language: georgian
| sreiresrrrrrrrrrrrrrnsssrrrrree rrirrrrrrrrre
| rsrngnrrrrsrrrrrrrorrrrrrrrrrr kLbHMHMHMHMHMHMHMHMMHMMMMMMMMMMMMM
| MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM hMLhM hMM hMM
| HMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
| KklllM Hklll so lll lllll k kbsbk i K H H k l H k lI H m
| cizar wait h examined pwaadmHa`eM supervision ng
| ieeeeeeeeeeeeeeeee maee eaeeeeeeeeeeeeeeeeeeee daeeeeeeeeeeeee
| ueeeeeeeeeeeee ea [?] mii smeii mmiei Yk` siiei savie
| siiit`t`iimemi, raee siime siii g'iiiiceiri saeieii siiei si
| veep`veiiie k`leesheeroeeeeeeeeeeeee. egez
| eqaksheieeeeeeeeeeeeeeeeeeeeeeeeeeeeea, nrropiroo mmumin
| seeknp`ee see[?]igosh szhebimeleleekirpie semeime seeimmm
| seenemeei se t Famose mindeyo hqe bywall jaini threshold ji jani
| den poder vlogging bywall Take the text Ba tou yodamj je
| te shake ba te shake baou contour but whatever Baou cube baou cup
| Baou rope Baou people Qeful Qeful imiiimibt`mit`iiit`iiiiiiii
| raoeoooenpeeeieiiiiiiiiiomiiiiiiiii riiiiiiiiiiimii
| nseeeeeeeeeeeeeee
| sareeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
| mjwai[?]v eeeid[?]vv nabdadeb lmireeeep`eduiveveeeiieeeee
| rareieeeeveeeeevee sarreeeeeeeeeeeeeeeeeeeeeeeeeee
| xshiiiiiiiiiiiii liiiiiii liiiiiiiiii liii liiiiiii laiiiii
| eiiiiiiiiiiiiiii iiii m
|
| I've also tested it on few other audio inputs and it failed to
| produce meaningful results on all of them with all models.
|
| There was one case with another audio [2] and tiny model, where
| it got at least some words close to their phonetic values, but
| printed them in cyrillic instead of Georgian and tried to
| interpret some Georgian words as Russian: whisper
| audio.wav --language Georgian --task transcribe --model tiny
| [00:00.000 --> 00:02.000] <<Zurab Gercha Dzhaparzis Gants
| Khatevarom [00:02.000 --> 00:04.000] umeren tsupasu
| Khizgeblotu kashchepasta [00:04.000 --> 00:06.000] a
| opozatsionermii chlen shonakhlari [00:06.000 --> 00:07.000]
| s drorodisat Sakartolom [00:07.000 --> 00:09.000] s
| akutariteritoriia biunda dai bronos [00:09.000 -->
| 00:10.000] ta tasovyi torusi sam kadr [00:10.000 -->
| 00:12.000] Sakartolomshii rovno ukraienistu [00:12.000 -->
| 00:13.000] shchoigo eknebo [00:13.000 --> 00:14.000]
| amsiasakheb kirchi metitausu [00:14.000 --> 00:15.000]
| khlebisliderma [00:15.000 --> 00:17.000]
| utsnoktangadatsema shcheisiaa ugraduntsa ...
|
| [1] https://www.youtube.com/watch?v=rE_zx_6RhL0 [2]
| https://www.youtube.com/watch?v=elrXgO8hjtI
| jcims wrote:
| Did respectably with some mumble rap:
| https://controlc.com/d353dafb
|
| (some NSFW words in the lyrics obv)
| derangedHorse wrote:
| Whisper performed a lot better than I would've expected it to!
| mmh0000 wrote:
| Okay this is super impressive. I just downloaded Whisper and fed
| it a random flac file I had handy and it did a really good job.
| Also impressive that it works on my weak CPU:
|
| A 3m07s flac took 5m to transcribe: $ whisper
| --device cpu 'BLACKPINK - BORN PINK/01 Pink Venom.flac'
| Detecting language using up to the first 30 seconds. Use
| `--language` to specify the language Detected language:
| korean [00:00.000 --> 00:10.000] Blackpink
| [00:11.000 --> 00:14.000] Kick in the door, wave in the coco
| [00:14.000 --> 00:16.000] pabkonineun cinge ggyeodeul saenggag
| malgo [00:16.000 --> 00:19.000] I talk to talk, run ways I
| walk walk [00:19.000 --> 00:21.000] him gamgo pab pab an
| bwado ceog [00:21.000 --> 00:24.000] By one and two by two
| [00:24.000 --> 00:26.000] nae songgeut du hanae tamyeon ajieun
| jung [00:26.000 --> 00:30.000] gas jasyo jigeum hwaryeohae
| T makes no sense [00:30.000 --> 00:32.000] You couldn't
| get a dollar out of me [00:33.000 --> 00:38.000] ja oneul
| bamiya nuntobeul pumgo [00:38.000 --> 00:41.000] mihoneul
| bbaeseum down [00:41.000 --> 00:43.000] Look what you made
| us do [00:43.000 --> 00:47.000] ceonceonhi neol jamjaeul
| paieo [00:48.000 --> 00:52.000] jami nal mankeum
| areumdaweo [00:52.000 --> 00:53.000] I bring the pain like
| [00:53.000 --> 00:57.000] diseutab, paengpaeng, diseutab,
| paengpaeng, diseutab, paengpaeng, paengpaeng [00:57.000 -->
| 00:58.000] Get em, get em, get em [00:58.000 -->
| 01:00.000] Straight till you don't like [01:00.000 -->
| 01:01.000] Whoa, whoa, whoa [01:01.000 --> 01:03.000]
| Straight till you don't like [01:03.000 --> 01:04.000] Ah,
| ah, ah [01:04.000 --> 01:05.000] Taste that, pink venom
| [01:05.000 --> 01:06.000] Taste that, pink venom
| [01:06.000 --> 01:08.000] Taste that, pink venom
| [01:08.000 --> 01:09.000] Get em, get em, get em
| [01:09.000 --> 01:11.000] Straight till you don't like
| [01:11.000 --> 01:12.000] Whoa, whoa, whoa [01:12.000 -->
| 01:13.000] Straight till you don't like [01:13.000 -->
| 01:14.000] Ah, ah, ah [01:14.000 --> 01:15.000] Blackpink
| and Amo [01:15.000 --> 01:17.000] Got it by the smack ram
| [01:17.000 --> 01:18.000] But rest in peace [01:18.000 -->
| 01:19.000] Please light up a candle [01:19.000 -->
| 01:20.000] This the knife of a vando [01:20.000 -->
| 01:22.000] Messed up and I'm still in saline ...SNIP...
| lunixbochs wrote:
| Looks like it defaults to the model called "small".
|
| I just ran some benchmarks - M1 Max, pytorch, with a 1.29
| second flac (looks like the matrix math was running on a single
| thread): tiny 146.522ms detect_lang
| 549.131ms decode_one 0.057ms tokenizer
| base 354.885ms detect_lang 1046.679ms
| decode_one 0.011ms tokenizer small
| 803.892ms detect_lang 3194.503ms decode_one
| 0.017ms tokenizer medium 2279.689ms
| detect_lang 10128.255ms decode_one 0.023ms
| tokenizer large 3656.478ms detect_lang
| 17249.024ms decode_one 0.016ms tokenizer
| adgjlsfhk1 wrote:
| For more benchmarks on an rtx 2060 (6gb), the "small" model for
| me is roughly 10x real-time and the tiny model is 30x real-
| time.
| lazylion2 wrote:
| I ran it on this clip
|
| https://clips.twitch.tv/ReliablePopularWerewolfOSkomodo-pcuw...
|
| because... hard accent.
|
| first run whisper thought its welsh so I had to run with
| --language en , and it did pretty well
|
| https://i.imgur.com/TQiYU9X.png
|
| took 36 seconds in Google colab
| manishsharan wrote:
| Oh this is a relief to have something opensource in this field. I
| had using Mozilla Deepspeech for transcribing my voice notes ,
| often with hilarious to incomprehensible results. DeepSpeech is
| dead ; so I will be sure to check this out.
| pabs3 wrote:
| DeepSpeech got spun out of Mozilla to coqui.ai and they are
| continuing the open nature of the project.
| w10-1 wrote:
| Naively, training the same model on multiple languages has
| interesting implications.
|
| On one hand, it may capture something "deeper" about language.
|
| On the other hand, it's likely to do great in general, but miss
| particularities of some language.
|
| Understanding the coverage of the training model seems a
| perennial problem. Is there any (shorthand) way to compare
| language model training corpora?
|
| Clearly if they use common subsets we have a literal comparison.
| I'm more interested in whether there's progress in characterizing
| corpora by speech styles, fluency, vocabulary sets, (noise)
| environment, emotionality, proposition types, etc.
|
| (btw: 25 minutes for a 9-minute segment on a 12-thread x86. Lots
| of jargon spelled as it sounds. Sentences capitalized but no
| punctuation. Overall good.)
| dindindin wrote:
| I'm not in the Speech Recognition circles and am looking for open
| source speech recognition I can play around with - would this be
| the new state of the art?
| mercurywells wrote:
| For me as a deaf person the current state of art (in terms of
| speed & usability) is the Recorder app on a Google Pixel phone
| (4a/6 Pro is what I've used)
| StevenWaterman wrote:
| Yes
| visarga wrote:
| Most probably
| graderjs wrote:
| The big question is why is Google's speech recognition in Gboard
| voice typing still so shit?
|
| https://news.ycombinator.com/item?id=32862172
|
| MIT licensed model seems way better
| The5thElephant wrote:
| How is it Apple, Google, or Microsoft are not further ahead of
| the game on speech recognition like this? They have the resources
| to hire the best ML researchers and throw tons of computing hours
| at it, yet Siri, Google, and Cortana continue to struggle to get
| anywhere near this level of comprehension.
| wongarsu wrote:
| Siri and Cortana have to run at least in real time, with
| reasonable compute resources. Probably faster than real time
| when the audio gets shipped off to the cloud and transcribed
| there. This model can't do that (in the "large" version, which
| the examples use).
|
| Also, you are comparing Whisper's highlight reel with everyday
| performance of other models. Nobody shows their weaknesses in
| their highlight reel.
| alex_marchant wrote:
| Siri until ios 15 was done in the cloud IIRC.
| coder543 wrote:
| Someone else in this thread[0] said Whisper was running at
| 17x real time for them. So, even a weak machine might be able
| to do an acceptable approximation of real time with Whisper.
|
| Also, I feel like shipping to the cloud and back has been
| shown to be just as fast as on device transcription in a lot
| of scenarios. Doing it on device is primarily a benefit for
| privacy and offline, not necessarily latency. (Although,
| increasingly powerful smartphone hardware is starting to give
| the latency edge to local processing.)
|
| Siri's dictation has had such terrible accuracy for me (an
| American English speaker without a particularly strong
| regional accent) and everyone else I know for so many years
| that it is just a joke in my family. Google and Microsoft
| have much higher accuracy in their models. The bar is so low
| for Siri that I automatically wonder how much Whisper is
| beating Siri in accuracy... because I assume it has to be
| better than that.
|
| I really wish there was an easy demo for Whisper that I could
| try out.
|
| [0]: https://news.ycombinator.com/item?id=32928207
| lunixbochs wrote:
| 17x realtime _on a 3090_
|
| I did some basic tests on CPU, the "small" Whisper model is
| in the ballpark of 0.5x realtime, which is probably not
| great for interactive use.
|
| My models in Talon run closer to 100x realtime on CPU.
| coder543 wrote:
| "CPU" isn't necessarily the benchmark, though. Most
| smartphones going back years have ML inference
| accelerators built in, and both Intel and AMD are
| starting to build in instructions to accelerate
| inference. Apple's M1 and M2 have the same inference
| accelerator hardware as their phones and tablets. The
| question is whether this model is a good fit for those
| inference accelerators, and how well it works there, or
| how well it works running on the integrated GPUs these
| devices all have.
|
| Brute forcing the model with just traditional CPU
| instructions is fine, but... obviously going to be pretty
| slow.
|
| I have no experience on the accuracy of Talon, but I've
| heard that most open source models are basically overfit
| to the test datasets... so their posted accuracy is often
| misleading. If Whisper is substantially better in the
| real world, that's the important thing, but I have no
| idea if that's the case.
| lunixbochs wrote:
| See https://news.ycombinator.com/item?id=32929029 re
| accuracy, I'm working on a wider comparison. My models
| are generally more robust than open-source models such as
| Vosk and Silero, but I'm definitely interested in how my
| stuff compares to Whisper on difficult held-out data.
|
| > Brute forcing the model with just traditional CPU
| instructions is fine, but... obviously going to be pretty
| slow.
|
| It's not that simple. Many of the mobile ML accelerators
| are more targeted for conv net image workloads, and
| current-gen Intel and Apple CPUs have dedicated hardware
| to accelerate matrix math (which helps quite a bit here,
| and these instructions were in use in my tests).
|
| Also, not sure which model they were using at 17x
| realtime on the 3090. (If it's one of the smaller models,
| that bodes even worse for non-3090 performance.) The 3090
| is one of the fastest ML inference chips in the world, so
| it doesn't necessarily set realistic expectations.
|
| There are also plenty of optimizations that aren't
| applied to the code we're testing, but I think it's
| fairly safe to say the Large model is likely to be slow
| on anything but a desktop-gpu-class accelerator just due
| to the sheer parameter size.
| lunixbochs wrote:
| Ok, my test harness is ready. My A40 box will be busy
| until later tonight, but on an NVIDIA A2 [1], this is the
| batchsize=1 throughput I'm seeing. Common Voice, default
| Whisper settings, card is staying at 97-100% utilization:
| tiny.en: ~18 sec/sec base.en: ~14 sec/sec
| small.en: ~6 sec sec/sec medium.en: ~2.2 sec/sec
| large: ~1.0 sec/sec (fairly wide variance when ramping up
| as this is slow to process individual clips)
|
| [1] https://www.nvidia.com/en-us/data-center/products/a2/
| coder543 wrote:
| Isn't the A2 much weaker than a 3090? So those results
| are promising.
|
| EDIT: for what it's worth, Nvidia rated the A2 at 18
| TFLOPS of FP16, and Apple rates the current A16 Neural
| Engine at 17 TFLOPS of FP16. I'm sure it's not an "apples
| to apples" comparison.
| lunixbochs wrote:
| If you count the GPU component and memory bandwidth, the
| Apple M2 is slightly weaker on paper for 16-bit inference
| than the NVIDIA A2, if you manage to use the whole chip
| efficiently. The A16 is then slightly weaker than the M2.
|
| Sure, the Whisper Tiny model is probably going to be fast
| enough, but from my preliminary results I'm not sure it
| will be any better than other models that are much much
| faster at this power class.
|
| Whisper Large looks pretty cool, but it seems much harder
| to run in any meaningful realtime fashion. It's likely
| pretty useful for batch transcription though.
|
| Even if you hit a realtime factor of 1x, the model can
| leverage up to 30 seconds of future audio context. So at
| 1x, if you speak for 10 seconds, you'll potentially need
| to wait another 10 seconds to use the result. This kind
| of latency is generally unsatisfying.
| coder543 wrote:
| EDIT: After writing and posting the original version of
| this comment, I did an experiment where I dictated it to
| Siri, and then saved that audio (which was recorded
| simultaneously), which I then fed to both Whisper's
| tiny.en and medium.en... Siri did terrible for me.
| Whisper tiny.en was 100% accurate, as far as I can tell,
| and the only thing Whisper medium.en did was add a few
| commas that tiny.en had missed. I actually ended up
| playing the audio file for Siri as well, and that did not
| end well either. YMMV, but even the tiny model seems very
| useful. tiny.en took 17.5 seconds to process the ~1
| minute audio file, and medium.en took 351 seconds, but I
| think there is a lot of room for performance optimization
| on this M2 MBA. The model evaluation was purely using the
| CPU, not GPU or neural engine, and it wasn't even using
| all of the CPU cores for whatever reason.
|
| ----
|
| With Siri dictation, I feel like I usually spend at least
| as much time correcting its mistakes as I do speaking the
| dictation itself. In some cases, that is still
| faster/easier than typing, but I would rather have a
| voice model that can work in about the same _total_
| amount of time without requiring constant corrections. If
| I speak for 30 seconds, then I can do other things for 30
| seconds while my phone processes it... that might
| actually be preferable if it gets it right. Otherwise,
| I'll be spending 30 seconds actively editing it anyways.
| Even an improvement on the number of edits required per
| dictation would be nice. Admittedly, I feel like Google
| and Microsoft _already_ do a much better job here.
|
| It could be interesting to use the tiny model to give a
| preview of the writing while the large model is taking
| its time, and then allow the user to tap on words that
| changed to see the predictions from the tiny model and
| correct back to them if they want. I was doing some
| experiments a few minutes ago, and on one audio clip, the
| tiny model wrote down a very literal interpretation of an
| uncommon sci-fi word, and that was more accurate than
| either the medium or the large models. The rest of the
| time, the larger models did better, as expected.
|
| But, I don't know. This is interesting to me, but I agree
| there could be issues with making is workable for real
| time transcription.
| MacsHeadroom wrote:
| > I really wish there was an easy demo for Whisper that I
| could try out.
|
| Like the colab notebook linked on the official Whisper
| github project page?
| coder543 wrote:
| Sure, but I did see one linked in another thread here on
| HN after posting that comment.
| The5thElephant wrote:
| Good point about realtime or not, however with ML I have
| found the weaknesses get addressed pretty fast by someone.
| There is a big step between proof of concept and practical
| application though, so we shall see.
| Kuinox wrote:
| OpenAI is owned by Microsoft FYI.
| neongreen wrote:
| Is it? Googling suggests that Microsoft invested in OpenAI
| but doesn't actually own it.
| Kuinox wrote:
| Oh, my bad looks like they only bought an exclusive license
| to GPT3.
| fxtentacle wrote:
| This AI has a 30 second delay on the audio processing because
| it needs to be able to "look into the future" to get these good
| results. That 30s delay would be unacceptable for
| Siri/Google/Cortana.
| coder543 wrote:
| A lot of models we currently use seem to do the same thing.
| The model will transcribe a "best effort" interpretation in
| real time, then as you can continue speaking, you'll see it
| go back and make corrections. I'm sure you can feed the first
| X seconds you have into the model, followed by (30-X) seconds
| of silence, and it will do real time transcription just
| fine... it would be weird if this broke anything. Then, as
| you get more speech, you continue getting better
| transcription of the first 30 seconds, then you switch to a
| 30 second sliding window.
|
| Maybe I'm missing something, but I don't see the problem
| here.
| fxtentacle wrote:
| Yes, that's because Whisper - like pretty much all of them
| - uses a Transformer encoder with Attention layers. And the
| Attention layers learn to look into the future.
|
| And yes, what you describe could be done. But no, it won't
| reduce latency that much, because the model itself learns
| to delay the prediction w.r.t. the audio stream. That's why
| ASR-generated subtitles usually need to be re-aligned after
| the speech recognition step. And that's why there is
| research such as the FastEmit paper to prevent that, but
| then it is a trade-off between latency and quality again.
|
| Also, running your "low-latency" model with 1s chunks means
| you now need to evaluate the AI 30x as often as if you'd be
| using 30s chunks.
| coder543 wrote:
| You just said the models pretty much all work the same
| way, then you said doing what I described won't help. I'm
| confused. Apple and Google both offer real time, on
| device transcription these days, so _something_ clearly
| works. And if you say the models already all do this,
| then running it 30x as often isn 't a problem anyways,
| since again... people are used to that.
|
| I doubt people run online transcription for long periods
| of time on their phone very often, so the battery impact
| is irrelevant, and the model is ideally running (mostly)
| on a low power, high performance inference accelerator
| anyways, which is common to many SoCs these days.
| fxtentacle wrote:
| I meant that most research that has been released in
| papers or code recently uses the same architecture. But
| all of those research papers use something different than
| Apple and Google.
|
| As for running the AI 30x, on current hardware that'll
| make it slower than realtime. Plus all of those 1GB+
| models won't fit into a phone anyway.
| coder543 wrote:
| > Plus all of those 1GB+ models won't fit into a phone
| anyway.
|
| I don't think that's a requirement here. I've been
| playing with Whisper tonight, and even the tiny model
| drastically outperformed Siri dictation for me in my
| testing. YMMV, of course.
| beastman82 wrote:
| In my unmeasured empirical observation Google has amazing
| speech recognition
| jeffbee wrote:
| I tried feeding the four examples from this announcement into
| Google as dictation inputs and it just sits there blankly. On
| the JFK speech test file in the repo, Google understands
| perfectly. The samples in the announcement are clearly
| outside the capabilities of anything Google has launched
| publicly, but I don't know how that translates to overall
| utility in every day applications.
| The5thElephant wrote:
| I agree they have the best compared to Apple, Amazon,
| Microsoft. However I don't think it is as good as what is
| being shown here by OpenAI.
| Vetch wrote:
| My experience with the APIs is Google is excellent and
| Microsoft is slightly better. And the offline model I've
| been using that's nearly as good as both is facebook's
| wav2vec2-large-960h-lv60-self.
|
| Don't believe what's on marketing pages, they rarely
| transfer to the real world. Will have to make time to try
| it and see. In theory, given task diversity and sheer
| number of hours, it should be a lot more robust but will
| wait on evidence before believing any claims on SoTA.
| KingMob wrote:
| Weird. I started working on an ASR SaaS in my spare time,
| and at least on the test podcasts, Google was the _worst_
| : https://www.sammaspeech.com/blogs/post/speech-
| recognition-ac...
| RockRobotRock wrote:
| Dude, this is insane. This is so much better than other speech to
| text libraries I've tried.
| danso wrote:
| This is an astonishing package. Every AI voice-to-text model I've
| tried on "The Wire's" famous "fuck" scene [0] usually fails,
| because the youtube clip's audio quality is bad and it's a scene
| with virtually no dialogue except breathing and "Fuck". But
| Whisper returned impressive results [1]
|
| [0] https://www.youtube.com/watch?v=DS6pE88Xg3s
|
| [1] $ yt-dlp --extract-audio --audio-format mp3
| -o wire-fuck.mp3 https://www.youtube.com/watch?v=DS6pE88Xg3s
| $ whisper --language en wire-fuck.mp3 [00:00.000 -->
| 00:02.000] Oh [00:13.260 --> 00:15.260] Fuck
| [00:15.260 --> 00:31.260] Motherfucker [00:50.700 -->
| 00:52.700] Fuck [00:52.700 --> 00:58.700] Oh
| [00:58.700 --> 01:10.700] Fuck [01:28.700 --> 01:55.900]
| Fuck [02:02.340 --> 02:03.700] Motherfuck.
| [02:10.220 --> 02:11.220] Oh, fuck. [02:11.780 -->
| 02:12.780] Oh, fuck. [02:25.900 --> 02:27.900] Fuck,
| fuck, fuck, fuck, fuck, fuck. [02:27.900 --> 02:28.900]
| Motherfucker. [02:32.900 --> 02:33.900] Oh, fuck.
| [02:34.900 --> 02:35.900] Fuck. [02:35.900 -->
| 02:36.900] Oh, fuck. [02:36.900 --> 02:37.900] Oh,
| fuck. [02:37.900 --> 02:38.900] Oh, fuck.
| [02:48.900 --> 02:49.900] Motherfucker. [02:53.900 -->
| 02:54.900] Fucking A. [02:54.900 --> 02:56.900] Mm hmm.
| [02:56.900 --> 03:12.900] Fuck. [03:26.900 -->
| 03:28.900] Motherfucker. [03:28.900 --> 03:32.900] Fuck
| me. [03:58.900 --> 04:01.900] Oh. [04:28.900 -->
| 04:34.900] Fuck.
| owenpalmer wrote:
| nsfw
| andy_xor_andrew wrote:
| Hold on, it does not only speech recognition, but also language
| translation, in the same model?
|
| What an interesting approach. What benefits does this have over
| having two dedicated models, one for speech-to-text, and another
| for translation?
|
| It just seems so odd, given the problems of speech-to-text and
| Spanish-to-English seems so different from one another (in terms
| of the problem domain). Seems so unusual to have both handled by
| one model!
|
| Does knowledge of speech-to-text carry over into knowledge of
| translation? Does knowledge of translation carry over into
| knowledge of speech-to-text? So weird.
| ByThyGrace wrote:
| Judging from the chart in their github README, Whisper performs
| much better in parsing Spanish audio than any other language
| and that in particular blows my mind. I would have expected
| English to be at the top of any such model, it being such an IT
| lingua franca.
|
| Now I wonder if it works equally well with Spanish from Spain
| (and its different regions) and Spanish from the New World (and
| in its myriads of different flavours).
| newhaus1994 wrote:
| My understanding is that multi-modal models are the primary
| focus of OpenAI right now, due to their stated goal of
| achieving AGI. This product is probably better thought of as an
| offshoot of their work to create a fully generalizable model,
| rather than a specific attempt to provide
| translation/transcription services.
| beanlog wrote:
| It sounds useful to me because you can use tone information to
| help with the translation, which text-to-text translation can't
| do. But I'm not sure if that's how this model actually works.
| TaylorAlexander wrote:
| It seems these days that language-oriented models are commonly
| becoming multilingual by default. There are a lot of common
| threads when understanding sentence construction between
| different languages. French and English have different rules
| but they will still have things like nouns, adjectives,
| subjects, prepositions, etc. It seems that by training models
| on many languages you get both a more robust understanding of
| language, and it saves you the trouble of having to make many
| more localized models for every language. I also believe that
| the other languages help the models construct sentences in
| languages which have very small training sets. If it has a few
| examples in a rare language as well as good translations to a
| better-known language, then it can provide good support for the
| rare language.
|
| We also see in image generation models that multi-modal
| networks are more powerful than single purpose networks. As we
| move towards more advanced AI systems I suspect we will see
| more and more generalizable networks with distinct advantages
| over separate networks that get plugged together.
| magicalhippo wrote:
| Would a multilingual modal perhaps also be better at
| understanding non-natives speech?
| TaylorAlexander wrote:
| Good question but I don't know the answer.
| thuttinger wrote:
| I tried running it in realtime with live audio input (kind of).
|
| If you want to give it a shot, you can find the python script in
| this repo: https://github.com/tobiashuttinger/openai-whisper-
| realtime
|
| A bit more context on how it works: The systems default audio
| input is captured with python, split into small chunks and is
| then fed to OpenAI's original transcription function. It tries
| (currently rather poorly) to detect word breaks and doesn't split
| the audio buffer in those cases. With how the model is designed,
| it doesn't make the most sense to do this, but i found it would
| be worth trying. It works acceptably well.
| catfan wrote:
| secret-noun wrote:
| impressive
| kkielhofner wrote:
| Haven't tried it yet but love the concept!
|
| Have you thought of using VAD (voice activity detection) for
| breaks? Back in my day (a long time ago) the webrtc VAD stuff
| was considered decent:
|
| https://github.com/wiseman/py-webrtcvad
|
| Model isn't optimized for this use but I like where you're
| headed!
| thuttinger wrote:
| Interesting. I'll take a look at this, thanks!
| Curiositry wrote:
| Perhaps this could be adapted?
|
| https://github.com/mozilla/DeepSpeech-
| examples/blob/master/m...
| Kirkman14 wrote:
| I've been trying Whisper on my old setup (Mac Pro 2012 running
| Mojave, with Radeon RX 580), and it's a pretty amazing tool.
|
| Unfortunately my system is not ideal for today's AI tools.
| Whisper runs only on the CPU, and it's slow.
|
| I know PyTorch recently added Metal support, but only for M-based
| Macs. Has anyone found a way to make it work with Intel Macs?
| minimaxir wrote:
| The model output can be tweaked to produce audio embeddings (akin
| to BERT for text embeddings and CLIP for image embeddings), which
| can lead to some _interesting_ applications as the previous two
| examples have demonstrated.
| FerociousTimes wrote:
| What do you mean exactly by audio embeddings?
| minimaxir wrote:
| Represent a given set of audio inputs as a numeric vector,
| which can then for example be finetuned for other ML/AI
| problems or placed in an embeddings database for easy ANN
| search with similar audio clips. In the extreme case it could
| facilitate better AI audio generation similar to how CLIP can
| guide a VQGAN.
|
| Although the 30 second minimum input is a bit of a bummer
| since it may not allow much granularity in the resulting
| embeddings.
| lynguist wrote:
| How can I use this (or something similar) for live translation? I
| don't mind if there's a 30s delay.
|
| As in I don't want to input a file, I want to input the
| microphone sound.
| agnos wrote:
| Would also like to know this. It looks like they're processing
| the audio file in 30 second chunks, so a naive approach of
| keeping a buffer of 30-second input stream chunks and just
| continually writing to an output .mp3 could work...
| blueberrychpstx wrote:
| Was wondering the same.
|
| I really wish I would have been paying attention in Unix
| class...
|
| Something like `microphone | chunk 3s | whisper | stdout` would
| be SO COOL!!! I think that's possible but too lazy to look
| more.
| spywaregorilla wrote:
| Hmm are there any noteworthy open sourced speech to speech
| models? Like transform a spoken line to another voice, copying
| both the words spoken and the inflections?
| cercatrova wrote:
| Their Scottish accent example is pretty good, I'd like to see it
| work on some very strong English accents like this one:
| https://www.youtube.com/watch?v=nJ7QB3om-QY
| homarp wrote:
| Detected language: english
|
| [00:00.000 --> 00:05.400] Gordy and County Kerry are
| investigating the theft of up to 60 sheep on Mount Brandon.
|
| [00:05.400 --> 00:10.400] One of the farmers is offering a
| reward for information leading to the return of the use,
|
| [00:10.400 --> 00:12.200] which are worth thousands of euro.
|
| [00:12.200 --> 00:14.200] Well, I'm fine with that.
|
| [00:14.200 --> 00:15.200] That's right.
|
| [00:15.200 --> 00:16.200] Do you own them?
|
| [00:16.200 --> 00:17.200] Anyone can say it.
|
| [00:17.200 --> 00:18.200] Fine with that.
|
| [00:18.200 --> 00:22.720] Last Saturday, Mikey Joe O'Shea
| brought his flock of Scotch sheep down from the mountain
|
| [00:22.720 --> 00:25.320] commonage ahead of lambing.
|
| [00:25.320 --> 00:29.840] He discovered over 50 were missing,
| allowing for a number of deaths and
|
| [00:29.840 --> 00:30.840] strays.
|
| [00:30.840 --> 00:34.600] Mikey is convinced over 45 sheep have
| been stolen.
|
| [00:34.600 --> 00:35.600] It was a good night.
|
| [00:35.600 --> 00:36.600] It would be a full moon there.
|
| [00:36.600 --> 00:37.600] It would be a good night.
|
| [00:37.600 --> 00:38.600] It would be bright out.
|
| [00:38.600 --> 00:40.600] There could be anyone going up in the
| mountains.
|
| [00:40.600 --> 00:41.600] It would be a good night.
|
| [00:41.600 --> 00:43.600] Well, that was 45 sheep missing.
|
| [00:43.600 --> 00:49.600] Mikey and the lambs and everything in
| the sheep, they counted out a nice bit of money.
|
| [00:49.600 --> 00:52.200] They've been doing the boat in
| Nassan.
|
| [00:52.200 --> 00:53.200] It's a big one. [00:53.200 -->
| 00:54.200] It's a big one. [00:54.200 --> 00:55.200] It's a big
| one.
|
| [00:55.200 --> 00:59.000] Mikey's next door neighbor says some
| of his sheep have also been stolen.
|
| [00:59.000 --> 01:00.000] Come back. [01:00.000 --> 01:01.000]
| Come back. [01:01.000 --> 01:02.000] Come back.
|
| [01:02.000 --> 01:03.000] I've been missing about 10 years.
|
| [01:03.000 --> 01:04.000] It's not all that difficult.
|
| [01:04.000 --> 01:06.320] All they've got to do is have a good
| dog.
|
| [01:06.320 --> 01:10.560] Have a good dog and go at night, some
| moonshine night.
|
| [01:10.560 --> 01:11.560] Just put the dog around him.
|
| [01:11.560 --> 01:14.120] Put him on a trailer and walk him.
|
| [01:14.120 --> 01:18.360] And then probably somebody else to
| pick him up.
|
| [01:18.360 --> 01:29.960] Everybody's doing it north, but he's
| doing it.
| hegemon8 wrote:
| Wow!
| cercatrova wrote:
| Wow that is incredibly impressive. At 0:53 is it translating
| as well? Didn't sound like English to me.
| mod wrote:
| Those are Irish.
| angrais wrote:
| Are you sure? I just ran some of Kimmy's sketches through it
| and ... The results are garbage.
| biggerChris wrote:
| We have reached sentient mode.
| howon92 wrote:
| I just tested it on a few of my YouTube videos in Korean and it's
| surprisingly good at transcription.
| dom96 wrote:
| This really makes me want to build a Amazon Echo/Google Nest/etc
| replacement that's open hardware, open source and most
| importantly recognises voice completely offline. I find that I
| don't use these smart devices for much more than setting timers
| anyway so this seems like an easy project.
|
| I just wonder what system requirements Whisper has and whether
| there are open source voice recognition models that are
| specifically built for embedded devices.
| solarkraft wrote:
| Are you thinking about reimplementing Mycroft?
|
| The Mycroft has done a lot of cool and important work in the
| field to ship an actual personal assistant product (stuff like
| wake word detection).
| dom96 wrote:
| hah, of course someone had the idea already and executed on
| it. But yeah, basically that but without the screen (probably
| would go a long way to decrease the cost, $299 is pretty
| steep for such a device)
| MayeulC wrote:
| Well, you can always install Mycroft on a Pi, or on your
| computer.
|
| Almond is also interesting as a voice assistant, though I
| think it doesn't perform speech recognition itself.
| sheepybloke wrote:
| One thing they don't touch much on is the STT, as they use
| models from third parties. You could definitely do
| something that utilizes this model and then feeds the
| tokens to some of their parsing code. I've been working on
| something similar to this, but burned out around adding the
| STT portion [0].
|
| [0]: https://github.com/Sheepybloke2-0/trashbot - It was
| called trashbot because the final implementation was going
| to look like oscar the grouch in a trashcan displaying the
| reminders.
| suyash wrote:
| This is only one side of the coin, you still need really good
| models for Speech Synthesis and then be able to have it all
| working in almost real time, ideally locally on device.
| ricopags wrote:
| As far as TTS goes, Mycroft.ai[0] has released a decent
| offline one.
|
| [0]https://mycroft.ai/
| MacsHeadroom wrote:
| I really want all this too. The smallest model is ~80mb and the
| largest is 3gb. Not sure about system requirements yet; but
| models that small suggest this may be doable locally on a
| single board computer.
|
| Edit: According to this comment[0] the base model runs in real
| time on an M1 CPU. The tiny model apparently decodes an audio
| file twice as fast. These are promising results.
|
| [0] https://news.ycombinator.com/item?id=32927360#32929739
| dom96 wrote:
| I'd be interested to see how well it performs on something
| like an RPi. M1 is pretty beefy.
| olao99 wrote:
| To be more precise the original comment said "M1 Max" which
| in itself is significantly beefier a bare "M1"
| lunixbochs wrote:
| For an offline (non-streaming) model, 1x realtime is actually
| kind of bad, because you need to wait for the audio to be
| available before you can start processing it. So if you wait
| 10 seconds for someone to finish speaking, you won't have the
| result until 10 seconds after that.
|
| You could use really small chunk sizes and process them in a
| streaming fashion, but that would impact accuracy, as you're
| significantly limiting available context.
| howon92 wrote:
| I just tried it in a few Korean YouTube videos and it's
| surprisingly accurate, to an extent where I would've thought it
| was done by a human.
| TOMDM wrote:
| Given how robust it seems to be with fast speech, I wonder if you
| could save cycles by speeding up the audio before feeding it in.
| eatsyourtacos wrote:
| Can this be used as a real-time transcription or is it too slow
| for that?
|
| Curious what anyone is using these days for a real-time
| transcription. It doesn't have to be perfect, but just good
| enough.
|
| My kids watch some youtube vidoes where people will make a mod
| where it converts them talking to text then look for keywords and
| spawn a boss in Terraria if you say the wrong keyword etc.
|
| I made a clone of that with the .NET System.Speech.Recognition
| library. It... works.. but my biggest problem is that #1 it waits
| until you are done speaking to translate to text on the callback,
| so there was too much of a delay for it to be fun.. the point is
| that it will be checking a stream of chatter. #2 is the
| recognition is pretty crap, I mean it's nearly good enough for my
| silly purpose but it's still pretty bad.
| blueberrychpstx wrote:
| If your family uses Apple devices, Apple offers free on-device
| speech recognition. Only caveat is that it needs to be
| restarted every minute due to whatever stupid limitation (or
| bug) they've introduced.
|
| https://developer.apple.com/documentation/speech/recognizing...
|
| Also, see `requiresOnDeviceRecognition`
| [deleted]
| [deleted]
| nshm wrote:
| Try https://github.com/alphacep/vosk-
| api/blob/master/csharp/demo...
| jayavanth wrote:
| thuttinger posted in this thread:
| https://github.com/tobiashuttinger/openai-whisper-realtime
| whimsicalism wrote:
| It might require too much work for what you are looking for,
| but the wav2letter library is the best real-time transcription
| OSS I have found by a considerable margin.
| davidzweig wrote:
| Out of interest, did you try Nemo?
| https://github.com/NVIDIA/NeMo
| whimsicalism wrote:
| No. I dont think it had streaming capabilities when i was
| doing this test two years ago, although i see it does now.
| NaturalPhallacy wrote:
| I tried it out and it's way too slow on my machine that is no
| slouch (Ryzen 9 5950/GTX 3080).
|
| It's doing seconds of translation per minute for me at least.
| TaylorAlexander wrote:
| The base model seems to run faster than real time on my
| machine. The "medium" model is larger and runs more slowly -
| roughly real time or maybe slightly slower.
| suyash wrote:
| Depends if you're trying to run it offline or over the cloud.
| dot1x wrote:
| That's all good and great, now please do OCR...
| tgtweak wrote:
| Good to see them releasing model weights - hopefully now that
| Stable Diffusion is out they will release Dall-E 2 source and
| weights as well.
| knaik94 wrote:
| I got a super weird results with the 'medium' and language
| Japanese (with a --task translate). The song is False Sympathy by
| Mondo Grosso.
|
| "[01:17.000 --> 01:32.000] Translated by Releska" when using the
| translate to english. That entire part of the song is
| instrumental. This line does not appear at all in the original
| transcribe only in the opus format rip.
|
| It shows up in the yt rip in format 251 (opus), but not in format
| 140 (aac from youtube), nor the flac rip. All three are giving
| different results.
|
| The translation quality is tied to bitrate. Same song converted
| to different words, the only difference being bitrates and
| formats. Converting my own rip with the same parameters as yt
| (opus @140 and then @130) didn't allow me to reproduce this
| error.
|
| The model hung for a solid extra minute at the end when
| translating to english, the last 90ish seconds of the song took
| real time 60 seconds, while the entire rest took about 90. The
| same behavior was not observed with the transcribe.
|
| Some of the english words are incorrect but that was expected.
| The first Japanese "mistake" I found was "Quan tehaEr Ren no"
| instead of "subeteha hutarino". With the left being what whisper
| wrote. A single random word "hey" was transcribed/translated to
| english even though it's the singer elongating the Yuan while
| singing the Le Yuan . "Luo chiteyuku Er Ren deXi garetaEr Ren
| noragu HEY" instead of "Luo chiteiku Suo detsunagareta Er Ren
| noLe Yuan " .
|
| I am using the official subtitles released on the youtube video.
|
| It's a complex Japanese song with both japanese and english, and
| the original transcribe took about 20 real time seconds to start
| with the first line, 130 seconds for the whole song. It seems to
| be showing results in 20 second window increments, but this seems
| to depend on what it considers audio and what it is throwing
| away.
|
| On my computer I wasn't able to use the large model because I ran
| out of VRAM, I have 8gb, not sure how much more it'd require. So
| I ran it with medium.
|
| The song is False Sympathy by Mondo Grosso. The mv is suggestive,
| in case that matters. I grabbed a fresh audio rip from Youtube
| because I didn't want to take it out of my cd case.
|
| https://www.youtube.com/watch?v=B6Y-WsgpzlQ
|
| It is translating this version differently from the director's
| cut version. I ripped both as opus.
|
| There is something weird about how it is handling the opus
| encoded version, as I find the same "Translated by Releska" in a
| wav version transcoded from the opus.
| adeptima wrote:
| Japanese output will produce lot of tiny mistakes. However the
| whole output is still good enough. Like 95% plus good enough.
|
| Found lot mistakes in 3-4 characters kanji ... and I guess most
| native Japanese will do mistakes time to time too, and this is
| why they pop up lot of buzzwords on screen with all kind of
| highlighting to avoid double guessing.
| Gazoche wrote:
| Pretty cool, and it seems to work on AMD GPUs as well. I've just
| tried it on my RX6800 with the ROCm build of PyTorch.
| amrrs wrote:
| Here's a live demo on Hugging Face Spaces if you want to try -
| https://huggingface.co/spaces/Amrrs/openai-whisper-live-tran...
| coder543 wrote:
| I've tried speaking to that demo several times... I used the
| built in feature to record from microphone, and I played back
| the samples to make sure they were audible and clear.
|
| Sometimes it outputs the words "thank you" (which I did not
| say), sometimes it outputs a period. It never once output
| anything I said. It seems completely broken.
|
| EDIT: apparently something about the combination of
| Safari+HF+Whisper was not working. I tried another Whisper demo
| on HF and had the same results. Switching to Chrome made it
| work flawlessly... I have no idea what kind of codec
| incompatibility was happening.
| clemnt wrote:
| this is amazing! got it working in French too
| TaylorAlexander wrote:
| Hey this looks great! I like to record audio notes while driving
| in my car after work, to kind of decompress my thoughts from the
| day. But I never go back and listen as they can be long and
| meandering. Sometimes in the audio log I will sum up my thoughts,
| but this might be 20 minutes in and hard to find. I really wish I
| had transcriptions so I could easily scan the full contents. I
| have tried Mozilla Deepspeech (I don't want a cloud solution) and
| I was surprised to find that I could not get Deepspeech to
| reliably transcribe them. There is a bit of road noise, though I
| think for a human listener they are easy to understand. It looks
| like this one might actually do the trick!
|
| EDIT: Tried it and it worked great! It is very easy to use. I
| just did the pip install line in the readme and was ready to go.
| You literally just run the one pip install line, and then you run
| the program in the format "whisper my_audio.wav" and it goes.
| Really nice job OpenAI!
| zhynn wrote:
| I do this too! I have been doing it for about a year now, and
| haven't ever run into someone else that does this kind of
| audio-journaling. Would you be up for comparing notes sometime
| about how it is working out for you? I am finding that it is
| extremely effective form of self-care, but with lots of
| personal caveats. I would be so interested to hear your
| experience.
| blueberrychpstx wrote:
| Count me in!! Working on tools actually to turn these
| transcriptions into something more social
| tekacs wrote:
| I do this too, and I've built some software for it just for
| myself.
|
| I'd love to chat and hear about how you use this! My email is
| in my profile, or I'm @tekacs on Twitter (and everywhere). :)
| TaylorAlexander wrote:
| Oh cool! Yeah I have stopped doing it lately as I was not
| really using them (I would like to use them for making rough
| notes for future youtube video scripts), though in general it
| does seem like good self care too even if I don't review
| them. That said I just tried the base model on one of my
| voice logs and it was pretty good! Trying the medium model
| now and it seems basically perfect. So I will have to start
| doing these logs more!
|
| Anyway I am pretty terrible with email but short exchanges
| can work for me, or maybe we can connect over signal. Send me
| a message to my email in my profile and I would be happy to
| sync up!
| Snitch-Thursday wrote:
| Google's recorder app for android will let you record audio
| files and make some transcriptions, right on the device.
| olao99 wrote:
| Google's recorder app is NOT available for most phones. Only
| Pixels and a couple of other selected handsets
| Tenoke wrote:
| I just tested it and it was pretty mediocre at least with my
| accent. I can definitely benefit from a decent app for quick
| note recording with a button press->transcribe->upload to
| gdrive/good UI app for later grepping.
| TaylorAlexander wrote:
| Was this with the default base model, or the medium or
| large model? This can be specified with the --model flag.
| Tenoke wrote:
| I meant the 'Google's recorder app' from the parent
| comment and not Whisper.
| TaylorAlexander wrote:
| Ah right, sorry got my comment threads mixed up! Someone
| else was asking about performance with accented English
| speakers in another comment.
| capableweb wrote:
| Is that application actually doing on-device transcription?
| Under "Data safety" on the Google Play page it says "This app
| may share these data types with third parties: Audio" which
| doesn't exactly instill confidence that my audio will 100%
| always stay on my device. It also says "Data is encrypted in
| transit" but if data stays on the device, why it has to be
| "encrypted in transit"? There should be no transit at all.
| bruckie wrote:
| Yes, it works completely offline, including transcription
| and recognition of music. There's an optional cloud sync
| feature, which I assume is the reason for the notice on
| Google Play.
|
| (Work for Google, don't speak for them.)
| capableweb wrote:
| Thanks. Whose the third party that might get access to
| the audio? First party would be me, second party would be
| Google and then the third?
| zed1726 wrote:
| bruckie wrote:
| I think it's just Google for backup, or other apps via
| Android's standard sharing sheet. You can read the
| details here: https://support.google.com/pixelphone/answe
| r/9516618?hl=en
| petercooper wrote:
| I'll probably explore using this, but I've used an app called
| Just Press Record to do what you say. Runs on Apple Watch too,
| so you can tap a complication at any time in the day, speak,
| and you get a transcript on your phone, etc.
| anigbrowl wrote:
| Oh nice - I have an immediate use case for this. This looks
| accessible enough that the sci-fi dream of instantaneous audio
| translation is suddenly within reach.
| petercooper wrote:
| Just tested this on some developer podcasts which usually fail
| hard given they're full of technical jargon, brand names, etc.
| Whisper is a revolution! It's picking up terms like Heroku,
| DigitalOcean, GitHub, ECS, AWS, etc. and capitalizing properly -
| something nothing else did unless you provided a whole pile of
| guiding vocabulary.
| ma2rten wrote:
| Did these podcasts have transcripts? You might be inadvertently
| evaluating it on data that it was trained on, which is
| basically cheating. Even if not, it might be trained on similar
| podcasts. Judging how good these kinds of models are is really
| hard.
| petercooper wrote:
| No transcripts, no. And recent episodes, within the past
| couple of weeks, so probably not part of the training either.
| WiSaGaN wrote:
| True. The test should only be done on the material released
| _after_ the model.
| code51 wrote:
| First off, it seems that the model can easily run on M1/M2 with
| minor modification. However `aten::_index_put_impl_` operator is
| current not supported and fallback always slows things down quite
| a lot.
|
| Second, is there a bug with how the script processes incoming
| audio segments? For a short 4 second clip, what I got was:
|
| > [00:00.000 --> 00:03.760] Okay, Eunice, travel plans. I need to
| be in New York on Monday, L.A. on Tuesday, New York on Wednesday,
| L.A. on Thursday. You're knocking Friday. Got it?
|
| > [00:03.760 --> 00:28.760] Got it.
|
| However the final segment should have been shy of 1 second. It
| mistakenly thinks the last segment was 25 seconds long and makes
| you wait for processing.
| Jnr wrote:
| Cool!
|
| I am one of the top contributors to the tiny Mozilla Common Voice
| data-set for my language. The data-set is very small compared to
| other popular languages and none of the other mentioned data-sets
| contribute to that language to train the model of Whisper.
|
| And even with so little data to train on it still works
| surprisingly well.
| catfan wrote:
| [zalgo redacted]
| dang wrote:
| Hey - can you please not zalgo on HN? It messes up the
| threads. I've redacted it from your posts now.
| archon1410 wrote:
| Where do they mention what datasets they've used? I've tried
| looking at the paper but can't find it.
| archon1410 wrote:
| Nevermind: I found it. It's on page 19 and 20 of the paper,
| under Appendix A ("Evaluation Datasets").
| jdmoreira wrote:
| Looking forward to see if this works well with foreign accents
| mminer237 wrote:
| They have an example in the post with a very thick Scottish
| accent. You should listen to it. It's pretty impressive.
| localy wrote:
| Are there any published benchmarks available outlining how this
| compares to other open source ASR software, such as Coqui.ai?
| NaturalPhallacy wrote:
| This is pretty incredible! https://i.imgur.com/03UFGc8.gif
| jjwiseman wrote:
| I'm seeing some weird bugs. For example, in one 30 minute mp3,
| about 6 minutes in it decided that someone said "2200." And then
| exactly 5.000 seconds later, "2200". And every 5.000 seconds
| after that, for the next 24 minutes. (No one actually repeated
| "2200" for 24 minutes.)
|
| A second run gave better results, but in most runs I do see
| instances where phrases repeat from 2-20 times.
| bickett wrote:
| Hard to keep up with all the great things. The AI community is
| really moving quick right now.
| aidenn0 wrote:
| For those on NixOS, here's a quick and dirty flake.nix that will
| let you make a venv in which to "pip install"'
|
| Just put it in a flake.nix, and "nix develop" followed by
| "virtualenv ./venv; . ./venv/bin/activate; pip install
| git+https://github.com/openai/whisper.git" {
| description = "Python 3.9 development environment";
| outputs = { self, nixpkgs }: let system
| = "x86_64-linux"; pkgs = import nixpkgs { inherit
| system; }; in {
| devShells.${system}.default = pkgs.mkShell {
| buildInputs = [ pkgs.ffmpeg
| pkgs.python39 pkgs.python39Packages.pip
| pkgs.python39Packages.numpy
| pkgs.python39Packages.pytorch
| pkgs.python39Packages.virtualenv ];
| }; }; }
| aidenn0 wrote:
| This should, in theory, work with CUDA; my GPU doesn't have
| enough RAM to do it (it runs out at 2.9GiB allocated, I have
| 4GiB, but am running a compositing desktop, which chews up
| about 600MiB; not sure where the other ~400MiB went)
|
| [edit]
|
| I confirmed CUDA worked with the "small" model, which used
| 3.3GB of GPU ram, and resulted in _much_ poorer recognition
| than the "medium" model on my CPU (but it ran at least two
| orders of magnitude faster). {
| description = "Python 3.9 development environment";
| outputs = { self, nixpkgs }: let system =
| "x86_64-linux"; pkgs = import nixpkgs {
| inherit system; config.allowUnfree = true;
| config.cudaSupport = true; }; in {
| devShells.${system}.default = pkgs.mkShell {
| buildInputs = with pkgs; [ cudatoolkit
| linuxPackages.nvidia_x11 cudaPackages.cudnn
| libGLU libGL xorg.libXi xorg.libXmu freeglut
| xorg.libXext xorg.libX11 xorg.libXv xorg.libXrandr zlib
| ncurses5 stdenv.cc binutils ffmpeg
| python39 python39Packages.pip
| python39Packages.numpy
| python39Packages.pytorch-bin
| python39Packages.virtualenv ];
| shellHook = '' export
| LD_LIBRARY_PATH="${pkgs.linuxPackages.nvidia_x11}/lib"
| ''; }; }; }
| magicalhippo wrote:
| CUDA worked fine with large on my 2080Ti FWIW. The speedup is
| ridiculous, as expected. My Ryzen 3800X used almost an hour
| transcribing a minute worth of speech, while the 2080Ti does
| it in like 10-20 seconds.
| aidenn0 wrote:
| How much GPU ram did it use?
| magicalhippo wrote:
| I'm on Windows, using Task Manager, the dedicated GPU
| memory went from 1GB before run to about 9.8GB for the
| most time during run, peaking at 10.2GB. So pretty close
| to the 11GB limit of my 2080Ti it seems.
| BasilPH wrote:
| Any opinions on what this means for speech-to-text companies like
| rev.ai and assmembly.ai ?
|
| We've tested open source solutions for s2t, like kaldi, but the
| quality was not good enough. However, one of the main advantages
| of a service like assembly.ai to me was that they offer sentence
| splitting in form of punctuation and speaker detection, which
| Kaldi does not.
|
| So I guess I answered my own question to some degree: A S2T
| service is more than just S2T. We already see assembly.ai add
| more and more features (like summarisation, PID redaction ect.)
| that are a value-add to plain S2T.
|
| Still, curious to hear what your take on that is.
| nshm wrote:
| You can apply public punctation model from Vosk on top of Kaldi
| output, you can also get speaker labels with existing open
| source software.
|
| On quick video transcription test this model is more accurate
| than AssemblyAI and Rev AI. It will be harder for them to sell
| pure ASR now. Some more business-oriented applications will
| still be important though, for example ASR as part of
| callcenter analytics solution or as a part of medical ERP
| system.
|
| The value of automatic summarization is small, without AI it is
| very hard to make it right, you need to be an expert in the
| field to understand what is important.
| phren0logy wrote:
| Rev AI will also create a transcription separated by multiple
| speakers, which it doesn't appear Whisper can do (yet). I
| expect that Whisper will overtake the alternatives soon,
| given that it's open source, but today it's not there yet.
| adeptima wrote:
| Japanese results looks pretty impressive!
|
| Took matsukoukuzira14Tou gaHai An niDa chiShang gerareru
| osutoraria(2022Nian 9Yue 21Ri )
| https://www.youtube.com/watch?v=bZkNIzeRBk4
|
| Extracted audio with youtube-dl -f bestaudio
| https://www.youtube.com/watch\?v\=bZkNIzeRBk4
|
| Converted into [00:00.000 --> 00:13.000] osutorariaNan Bu noDao
| de, Zhen tsuXiang kuzira14Dong gaHai An niDa chiShang gerareteSi
| ndeirunogaJian tsukari, Zhuan Men Jia gaDiao Cha notameYuan Di Ru
| rishimashita. [00:13.000 --> 00:25.000] Yuan Di
| medeianiyorimasuto, osutorariaNan Bu nokinguDong de, 19Ri , Shao
| nakutomo14Dong noZhen tsuXiang kuziragaHai An niDa chiShang
| gerareteSi ndeirunogaJian tsukarimashita. [00:25.000 -->
| 00:31.000] hotondogaRuo iosutowoJian rare, Zhuan Men Jia gaXian
| Chang niZhong mukiDiao Cha niDang tatsuteimasu. [00:31.000 -->
| 00:41.000] kuziranoSi Hai haDa kikuYun ndariMai
| metarisurukotogaNan shiitame, Zi Ran niFen Jie sarerunowoDai
| tsuFang Zhen gaJian Tao sareteimasu. [00:41.000 --> 00:52.000]
| mata, Si Hai woJu i, samegaHai niJi maruKe Neng Xing
| gaarutoshite, Yuan Di Dong Ju hasahuanadoniZhou Wei niJin
| dukanaiyouniHu bikaketeimasu. [00:52.000 --> 01:02.000] Yi Fang
| , 21Ri nihatasumaniaDong deoyoso230Dong nokuziragaBang Bian niDa
| chiShang geraretaZhuang Tai deJian tsukarimashita. [01:02.000
| --> 01:07.000] oyosoBan Shu gamadaSheng kiteiruMo Yang deJi Zhu
| Huo Dong gaJin merareteimasu. [01:07.000 --> 01:23.000] Jian
| tsukatsutanoha, gondokuziranoZhong Jian toJian rareteimasu.
| knaik94 wrote:
| Did you try translating them to english? I want to see if you
| get a similar error as me with a random phrase "Translated by
| Releska" showing up.
| lynguist wrote:
| It's called hallucination. As the model is trained on
| unsupervised data, such errors do seldom happen. The model
| picks up that such phrases occur in translations and inserts
| them even if they do not appear in the source. This is
| described in the paper.
| knaik94 wrote:
| I came across it during a silent/instrumental portion in
| the song I was testing. I asked only because I am curious
| how frequently the error might show up, I don't expect it
| to be very common. It's looking at phrase level instead of
| word level timestamps which is going to make it hard to
| tokenize music. I asked simply because the parent comment
| also tested on Japanese.
| gzer0 wrote:
| Shocked at how good the results are, and how easy of an
| installation it is.
|
| Here are the exact steps to follow to get it running on Ubuntu
| 22.04 via WSL and yt-dlp: 1. pip install
| git+https://github.com/openai/whisper.git 2. yt-dlp
| -f 'ba' -x --audio-format mp3
| https://www.youtube.com/watch/?v\=bZkNIzeRBk4 3.
| renamed the file to test.mp3 4. whisper test.mp3
| --language Japanese --task translate --model large
|
| Note: the large model will download a ~3Gb file
| NaturalPhallacy wrote:
| I did something similar (my ytdl is ytdlp too). You don't
| even have to grab just the audio, it'll take a webm:
| https://i.imgur.com/03UFGc8.gif
|
| Amazing work.
| adeptima wrote:
| cause ffmpeg inside
|
| https://github.com/openai/whisper/blob/main/requirements.tx
| t
|
| should process most formats
| adeptima wrote:
| "--model large" option produces much better results at higher
| resources consuming costs
| tullie wrote:
| Great to see OpenAI finally being open :)
| simmanian wrote:
| Could someone tell me whether it's possible to somehow feed data
| into this project to improve its translation and transcription
| capabilities on our own?
| nicholasjarnold wrote:
| This is so cool! I was just speaking to a non-technical family
| member about privacy concerns around using "OK Google" and the
| like. They responded inquiring about "private" alternatives, to
| which my answer was "I'm not aware of good ones that give you
| that level of accuracy and convenience."
|
| Perhaps this development along with continued optimization and
| device compute power increases will lead us into a near-future
| where things like Mycroft devices and cellphones could have
| local-only speech-to-text and translation capabilities which are
| accurate even with environmental background noise variations
| encountered IRL.
|
| Great work OpenAI team!
| runlevel1 wrote:
| I ran it on some fire department radio recordings from scanners
| on Broadcastify. It did remarkably well.
|
| For reference, GCP's Speech-to-Text didn't detect any speech from
| this clip -- even when using the enhanced phone model.
| mwlp wrote:
| Super impressive. I tested it on a Japanese streamer whose
| enunciation isn't exactly perfect and it did a decent job:
| https://www.youtube.com/watch?v=ROiOU1scaNA
| [00:00.000 --> 00:06.500] Since the last one started, the number
| of times I've eaten has decreased. [00:06.500 -->
| 00:11.000] If I get too carried away with the last one, I'll get
| hungry and do it. [00:11.000 --> 00:14.500] I don't have
| time to eat. [00:15.500 --> 00:18.000] I'm going to eat
| now. [00:20.000 --> 00:23.000] It's going to take about 10
| minutes from here. [00:23.000 --> 00:31.000] It's been a
| while since I've had my last meal. [00:31.000 -->
| 00:36.000] I feel like I'm losing myNu Zi Li . [00:36.000
| --> 00:39.000] I have to go back to my original self.
| [00:39.000 --> 00:44.000] I have to get ready and go to bed.
| [00:44.000 --> 00:46.000] It's not good. [00:46.000 -->
| 00:51.000] I've been drinking a lot lately, so I'm going home.
| [00:51.000 --> 00:53.000] I have to get my nails done this fall.
| [00:53.000 --> 00:54.000] Halloween nails. [00:54.000 -->
| 00:57.000] Halloween, Halloween, Halloween. [00:57.000 -->
| 00:59.000] I'm going to the beauty salon today. [00:59.000
| --> 01:02.000] I'm going to get my nails done the day after
| tomorrow. [01:02.000 --> 01:10.000] I used to look at a
| lot of clothes, but I stopped looking at them. [01:10.000
| --> 01:12.000] I'm going crazy. [01:12.000 --> 01:22.000]
| My stomach's stopped in the middle of summer.
| alach11 wrote:
| How long until this gets implemented in Twitch? Real-time
| subtitles for any stream in the language of your choice?! That
| would be huge.
| adeptima wrote:
| translation is not the strongest part. transcription looks very
| good.
| magicalhippo wrote:
| It's struggling with Norwegian. Which I guess isn't shocking.
| The large model performs a fair bit better than the small,
| though neither is "good".
|
| Though I assume the amount of Norwegian it has been exposed to
| is fairly limited, so in that light I'm actually impressed as
| well.
|
| I tried it on a news segment from the radio[1], this is the
| large model output: [00:14.000 --> 00:17.200]
| En skamlos krenking av FN pakten. [00:17.200 -->
| 00:24.000] USAs president og verdensledere svarer pa den
| russiske presidentens atomtrusler og krigsmobilisering.
| [00:25.500 --> 00:29.400] Arbeidsklaer som er ment til a vaere
| til begge kjonn, har det med a vaere tilpasset.
| [00:29.400 --> 00:33.400] Men hvordan ville det gatt, om det
| var motsatt? [00:34.100 --> 00:38.900]
| Dyrevernsorganisasjon vil ha digital merking av regnstyr,
| [00:38.900 --> 00:44.900] men naeringen selv insisterer pa den
| gamle tradisjonsrike maten med rissing av kniv.
| [00:45.600 --> 00:51.400] Mange stromselskaper er positive til
| a tilby kundene fastpris pa strom, og det arevis.
| [00:51.400 --> 00:59.900] Da risikerer de a matte betale mye i
| nettopp aretsvis, sier aktorer som aldri tilbyr fastpris.
| [00:59.900 --> 01:21.900] Dette er onsdagens Dagsnytten. Jeg
| heter Espen As.
|
| For reference, here's what he actually said, from the source[1]
| itself: * En skamlos krenking av FN-pakten.
| USAs president og verdensledere svarer pa den russiske
| presidentens atomtrusler og krigsmobilisering. *
| Arbeidsklaer som er ment a vaere til begge kjonn, er som regel
| tilpasset ... menn. Hvordan hadde det gatt om det var motsatt?
| * Dyrevernsoganisasjon vil ha digital merking av reinsdyr, men
| naeringen selv insisterer pa den gamle tradisjonsrike maten med
| rissing av kniv. * Mange stromselskaper er positive til
| a tilby kundene fastpris pa strom - og det i arevis. -
| Da risikerer de a matte betale mye i nettopp; arevis, sier
| aktor som aldri tilbyr fastpris Dette er onsdagens
| Dagsnytt 18 - jeg heter Espen Aas.
|
| The translation didn't fare that well though:
| [00:14.000 --> 00:17.000] A shameless violation of the UN
| treaty. [00:17.000 --> 00:24.000] The US president and
| world leaders respond to the Russian president's nuclear
| threats and war mobilization. [00:24.000 --> 00:33.000]
| Work clothes that are meant to be for both genders have to be
| suitable, but how would it be if it was the other way around?
| [00:34.000 --> 00:44.000] The animal welfare organization will
| have a digital marking of reindeer, but the industry itself
| insists on the old traditional way of tearing a knife.
| [00:45.000 --> 00:51.000] Many electricity companies are
| positive in offering customers fixed electricity prices, and
| that is annual. [00:51.000 --> 00:58.000] Then they
| risk having to pay a lot in just a year, says an actor who has
| never offered fixed prices. [00:58.000 --> 01:20.000]
| This is Wednesday's Dagsnytt 18. My name is Espen As.
|
| For reference, here's Google Translate's attempt, which is
| pretty good: * A shameless violation of the
| UN Charter. The US president and world leaders respond to the
| Russian president's nuclear threats and war mobilization.
| * Work clothes intended for both sexes are usually adapted to
| ... men. How would it have gone if it had been the other way
| around? * Animal welfare organizations want digital
| marking of reindeer, but the industry itself insists on the
| old, traditional way of marking with a knife. * Many
| electricity companies are positive about offering customers a
| fixed price for electricity - and for years. - Then
| they risk having to pay a lot in precisely; for years, says a
| player who never offers a fixed price This is
| Wednesday's Dagsnytt 18 - my name is Espen Aas.
|
| [1]:
| https://radio.nrk.no/podkast/dagsnytt_atten/l_5ce3e323-97a3-...
| (not sure if it's available outside of Norway)
| perlgeek wrote:
| Everything (and everyone, including myself :D ) seem to
| struggle with Norwegian, it seems the corpus size is simply
| too small. And/or maybe the market.
|
| Deepl didn't do any Norwegian last I looked, even though it
| does most other Germanic languages (including Danish and
| Swedish).
|
| Duolingo doesn't have a Norwegian class for Germans either,
| though they do have one with English as the source language.
| olao99 wrote:
| How are you getting the transcription of the NRK episode? I
| am learning Norwegian and often struggle to find reliable
| transcriptions for audio where the text exactly matches the
| audio (often subtitles are heavily edited compared to what's
| actually being said)
| magicalhippo wrote:
| The stuff I quoted was listed as an abstract of sorts for
| the episode. I know NRK is very good at providing subtitles
| for their TV productions, but as you say they're
| abbreviated.
|
| I'm guessing maybe audio books along with the actual books
| would be the best source for such? I mean there's Mozilla
| Voice, but it's quite limited in the Norwegian department
| and perhaps not quite as interesting as an audio book would
| be.
| magicalhippo wrote:
| Re-reading the transcription, I guess I was a bit harsh by
| saying it's not "good". It gets most of it right, but it
| keeps messing up some key words. Like "regnstyr" (not a word)
| rather than "reinsdyr" (reindeer), or "Dagsnytten" rather
| than "Dagsnytt 18".
|
| It also didn't handle the hanging "... menn", instead
| thinking it was the start of the following sentence. Almost
| everyone would understand it was the end of the sentence
| based on the context.
|
| The double-A vs A is not an issue as it's the same letter,
| double-A is the older form.
|
| The small model was considerably worse than the large one
| though.
| kiwih wrote:
| Given this, are there good (and available/open source) models for
| text to speech? Last time I tried everything still sounded
| extremely robotic, and/or were a pain to set up and run. It would
| be fun to set up a pipeline where the two processes
| 'communicate'.
| obscur wrote:
| Measuring performance in rounds of successful Chinese whisper
|
| (irony)
| Bayko wrote:
| So I guess we can easily use this to generate subtitles?? Which
| would be nice! Cause ummm some of the movies that I download from
| the internet arrrrrr! don't have subtitles available
| pen2l wrote:
| Neat, https://github.com/openai/whisper - they have open-sourced
| it, even the model weights, so they are living up to their name
| in this instance.
|
| The 4 examples are stunningly good (the examples have speakers
| with heavy accents, speaking in foreign language, speaking with
| dynamic background noise, etc.), this is far and away better than
| anything else I've seen. Will be super curious to see other folks
| trying it out and seeing if it's as robust as it seems, including
| when confronted with audio speech with natural tics and uhhh's
| and uhmm's and everything in-between.
|
| I think it's fair to say that AI-transcription accuracy is now
| decidedly superior to the average human's, what the implications
| of this are I'm not sure.
| anigbrowl wrote:
| It was already better. I edit a podcast and have > a decade of
| pro audio editing experience in the film industry, and I was
| already using a commercial AI transcription service to render
| the content to text and sometimes edit it as such (outputting
| edited audio).
|
| Existing (and affordable) offerings are so good that they can
| cope with shitty recordings off a phone speaker and maintain
| ~97% accuracy over hour-long conversations. I'm sure it's been
| an absolute godsend for law enforcement other people who need
| to gather poor-quality audio at scale, though much less great
| for the targets of repressive authority.
|
| Having this fully open is a big deal though - now that level of
| transcription ability can be wrapped as an audio plugin and
| just used wherever. Given the parallel advances in resynthesis
| and understanding idiomatic speech, in a year or two I probably
| won't need to cut out all those _uuh like um y 'know_ by hand
| ever again, and every recording can be given an noise reduction
| bath and come out sounding like it was recorded in a room full
| of soft furniture.
| adamgordonbell wrote:
| I've not found that to be the case.
|
| For technical content, I use Rev.com and provide a glossary
| and real humans do the transcript. Other AI transcription
| services get lots wrong because the context often matters.
| Words like "TCP/IP" or "FAT disk format" or "Big Endian" I've
| never found AI so far to handle well.
|
| I'm interested to test out whisper on this one.
|
| https://corecursive.com/063-apple-2001/
| deegles wrote:
| There's already software that can imitate a person's voice,
| so we have all the pieces already to do speech-to-text, clean
| up with GPT-3, and back to text-to-speech in the original
| person's voice. Maybe with a style transfer to keep the
| person's inflections etc the same?
| Karuma wrote:
| I think something similar already exists. See this, for
| example: https://koe.ai/recast/
|
| Although I don't know if they're using anything similar to
| what you suggest. Very cool idea, anyway!
| biomcgary wrote:
| Since you work on podcasts, do any open source transcription
| tools currently identity the speaker in the output? This
| would be particularly helpful for interviews.
| nico wrote:
| Not sure about open source, but in general, automated
| transcription systems need a separate track for each
| different speaker. So for example, for a phone call with
| one person on each end, you need two separate channels
| (recording systems usually split them left/right on one
| stereo file).
| solarmist wrote:
| Any recommendations for particular services?
| anigbrowl wrote:
| I use a service called sonix.ai. It's paid but I think they
| have a free tier or trial period, and it's not very
| expensive. I'm excited about this new OpenAI thing because
| I'd rather do it on my own hardware than send it to the
| cloud, but this company has earned its commercial success.
| nonoesp wrote:
| I'm not sure if you've tried Descript, but their ML-based
| "Studio Sound" filter makes bad audio sound like it was
| recorded and edited nicely.
| solarmist wrote:
| That is an exciting possibility. Being able to fix bad setups
| and missed takes automagically. It's always been possible,
| just expensive and time consuming for moderate improvements.
| thfuran wrote:
| >~97% accuracy over hour-long conversations. I'm sure it's
| been an absolute godsend for law enforcement
|
| 97% accuracy means roughly three or four errors per minute of
| speech. That seems potentially extremely problematic for
| something like law enforcement use where decisions with
| significant impact on people's day and/or life might be made
| on the basis of "evidence".
| gs17 wrote:
| Yeah, I tried to use automated transcription for a research
| project and we had to do it all manually because the few
| errors (I would say it did pretty well given our recording
| quality) were often dropping words like "not", which
| changed the whole meaning of a sentence! It was a useful
| assistance during transcription, but I really hope they
| would verify it was correct before arresting anyone based
| on it.
| anigbrowl wrote:
| No it isn't. That just means 2-3% of your content needs to
| be double-checked by a person at the audio level, saving
| huge amounts of time - equally true of human transcription,
| in which individual words are often [UNINTELLIGEBLE].
|
| Would you want to review this fully before going into
| court, absolutely - because you'd want to play the
| recording to a jury for emotional impact. Can you rely on
| it when you want to quickly read through hours of
| conversation and make decisions about whether to invest
| further resources (which might just mean another hour of
| listening back to the original audio)? Also absolutely.
| Bear in mind that a lot of these errors have little to no
| semantic impact, being on the same level as typos or
| misspellings in a written communication.
|
| Bear in mind too that if law enforcement (honest or not) is
| so interested in you that they're willing to record your
| conversations, your day is already ruined, you just don't
| know it yet. The change here is one of scale rather than
| quality.
| wging wrote:
| Doesn't it mean 100% of your content needs to be double-
| checked? You can't easily identify which 2-3% of your
| content has errors. I'm aware that errors are more likely
| when the model is less confident of its predictions, but
| that shouldn't be enough.
|
| (edit for clarification: errors are not always something
| like "[UNINTELLIGIBLE]", where the system knows it
| doesn't know; they can also be misrecognitions that the
| system believes in with high confidence.)
| u8 wrote:
| I had to do a lot of manual transcription in Journalism
| school. Using a tool like Descript saved HOURS of my
| life. Generally it was 80% accurate, but going over an
| two-hour-long recording again at 3x speed while reading
| over the transcript, fixing errors from memory or pausing
| took a five hour job down to 30-40 minutes. Either way,
| somebody is going to have to listen to the recording.
| This just removes a layer of grunt work.
| 6gvONxR4sf7o wrote:
| > I'm aware that errors are more likely when the model is
| less confident of its predictions, but that shouldn't be
| enough.
|
| Suppose 90% of the errors are in the 10% where the model
| is least confident. Then you can review just 10% of your
| content and take a 2% error rate down to 0.2% error rate.
| woah wrote:
| You double check things that you think are important, in
| this case, passages that will be used as evidence in
| court.
| guelo wrote:
| Maybe you could run the text through a grammar checker to
| identify the errors.
| thfuran wrote:
| That might work if people were required to speak
| grammatically.
| NaturalPhallacy wrote:
| For real. The way people normally speak, with
| backtracking, repetition, restarting sentences, or
| stopping mid sentence and starting a new one with
| entirely different nouns or entire subjects is perfectly
| normal in synchronous conversation and isn't jarring, but
| written down as is, it's like 40% noise.
| worthless-trash wrote:
| For a good example of this, read ANY of trumps speaches
| transcribed.
| NaturalPhallacy wrote:
| I mean if you want to make it unnecessarily political,
| Biden's are worse:
| https://www.youtube.com/watch?v=3bWM1zsnTJc
| worthless-trash wrote:
| Oh no no, i wasn't trying to be political, its just one
| that I read.. and wow you're right!
| gzer0 wrote:
| To be fair, you chose a video that displays an
| amalgamation of the biggest gaffes of 2021 for Biden.
|
| "During his term as President of the United States,
| Donald Trump made tens of thousands of false or
| misleading claims. The Washington Post's fact-checker had
| tallied the number as 30,573 by January 2021, an average
| of about 21 per day by the end of his presidency."
| [1][2][3][4]
|
| I think it's fair to say there would be a 100 hour long
| plus video / documentary if they were all compiled into
| one. lovely! - [1] Fact Checker (January
| 20, 2021). "In four years, President Trump made 30,573
| false or misleading claims". The Washington Post.
| Archived from the original on January 20, 2021.
| - [2] Kessler, Glenn (January 23, 2021). "Trump made
| 30,573 false or misleading claims as president. Nearly
| half came in his final year". The Washington Post.
| Archived from the original on January 24, 2021. Retrieved
| January 24, 2021. - [3] Elfrink, Tim (August
| 14, 2020). "'Do you regret at all, all the lying you've
| done?': A reporter's blunt question to Trump goes
| unanswered". The Washington Post. Retrieved August 14,
| 2020.
|
| [4] https://en.m.wikipedia.org/wiki/Veracity_of_statement
| s_by_Do...
| donkarma wrote:
| TheCapeGreek wrote:
| Having done audio transcription in college as a side gig,
| it takes a lot longer than it sounds. Even at a decent
| 100wpm you'll take about 5 minutes to type out 1 minute
| of audio.
|
| Not having to pause + rewind will save a ton of time for
| that 3%.
| vivegi wrote:
| You can also use multiple transcription engines and then
| use mismatches among the text streams to narrow down the
| % of content that needs to be reviewed. This is quite
| similar to multi-voting OCR for document images.
|
| The principle is that the engines have different failure
| modes (hopefully) and therefore the 2-3% error rate of
| each engine is in different areas of the audio. The key
| underlying assumption is that the events are mutually
| exclusive.
|
| With 3 engines, you can use something like 2-of-3 stream
| matches to override the stream that mismatches.
| anigbrowl wrote:
| By the time you're prosecuting someone in court, yes of
| course you double, triple, quadruple check everything.
| That's why lawyers get paid the big bucks (for now...).
| But yes you can identify which content probably has
| errors and flag it as such.
|
| Look, I have decades of experience dealing with human
| speech, and not just as an editor - I can trace the human
| voice from neural impulses in Broca's region through the
| physiology of vocal production, mechanical transduction
| into electrical signals, discrete fourier transforms of
| the resultant waveforms into spectral information and
| back again, the reproduction of altered signals from
| time-aligned speakers to create a sense of
| spatialization, how those are processed in the human ear,
| and how the cilia are connected by nerves back to your
| brain. I'm a good enough editor that I can recognize many
| short words by sight of a waveform, or make 10 edits in a
| row by sight and know it will sound good on playback.
|
| So when I say that machine transcription is as good as
| human realtime transcription now, I say so with the clear
| expectation that those decades of craft are very close to
| being rendered obsolete. I absolutely expect to hand off
| the mechanical part of editing to a machine within 2
| years or so. It's already at the stage where I edit some
| interviews as text, like in a word processor, and then
| export the edited document as audio and it's Good Enough
| - not for every speaker, but more than half the time.
|
| NPR and a lot of commercial broadcasters cut their
| material this way already, because you can get the same
| result from 30 minutes of reading and text editing that
| would require 3 hours of pure audio editing with no
| transcription.
| yourapostasy wrote:
| _> So when I say that machine transcription is as good as
| human realtime transcription now..._
|
| Would you go as far as to assert machine transcription
| can be used as an objective benchmark of a speaker's
| verbal legibility?
|
| It is fraught with political and interpersonal dynamics
| to approach someone even privately one on one today and
| gently suggest their career would get a huge boost if
| they hired a voice coach to help improve their verbal
| communication delivery. So even when I don't directly
| mention their accent, it becomes a very sensitive subject
| with many.
|
| However, if audio professionals like you can point to a
| system and say the raw biomechanics and acoustic physics
| of the world dictate that this is as physically and
| psychometrically good as audio parsing of human speech
| gets regardless whether the system was biologically
| evolved or ML evolved, the conversation can be couched
| even more objectively.
|
| I enable recording and voice transcription in every
| meeting I can (ostensibly for DE&I but really for my own
| selfish purposes), and already observe in myself I have
| to work hard to overcome a tendency to gloss over
| speakers who don't transcribe well when I review meeting
| transcripts to jot down any key information I might have
| missed taking notes upon during the meeting.
|
| Note that I'm perfectly aware that my foreign language
| verbal skills are nowhere near the English skills of
| those I have tried to help. If the _lingua franca_ of the
| coding world switched to Urdu tomorrow, then I'd hire
| help to learn and polish my spoken Urdu, like I went to a
| speech coach when learning public speaking because I can
| always use help in the many skills I lack.
| frognumber wrote:
| What tools do you use to do this? I once hacked together
| an editor like this maybe a decade ago -- edit speech as
| text from OCR -- and sorely need one now.
|
| Alignment of video to text is a big problem for me too.
| boundlessdreamz wrote:
| This can be done via https://www.descript.com/ You can
| edit video/audio by editing the transcript.
|
| You can even add/modify words that weren't originally
| there https://www.descript.com/overdub
| etienne618 wrote:
| Presumably you can use the 97% that is correctly
| transcribed to rapidly filter out the relevant content.
| This is likely to be only a small portion of the total
| content. Then you check 100% of that.
| datalopers wrote:
| If you know which 2-3% are the false positives, you have
| a very lucrative business model.
| MonkeyMalarky wrote:
| When doing validation, I find it will often be the same
| errors repeated again and again in a transcription. Like
| it will fail on someone or some thing's name (that is
| rare / unique) and map it onto a known similar sounding
| word.
| gnramires wrote:
| I think an [UNINTELLIGIBLE] indication would be a great
| addition to automatic transcription systems.
| inanutshellus wrote:
| It'd [UNINTELLIGIBLE score="92%" alternatives="pro-
| rabble; pourable"]probably[/UNINTELLIGIBLE] be useful to
| make a markup-based output... though you'd probably find
| it gave you more info than you wanted.
| anigbrowl wrote:
| It already exists. The commercial product I use most is
| called sonix.ai and I think they have a free tier or
| trial period. It has shortcomings but it's shockingly
| good, despite having some limitations.
| yencabulator wrote:
| Google Voice voicemail transcription _used_ to do this,
| with varying levels of gray. It seems that feature is
| gone, now.
| thfuran wrote:
| >equally true of human transcription, in which individual
| words are often [UNINTELLIGEBLE].
|
| ML systems somewhat notoriously do not necessarily make
| the same sorts of errors that a human would. And I'd
| expect a large portion of the errors to be transcribing
| the wrong words rather that indicating that a word
| couldn't be transcribed. That sort of error means that
| you can't really get away with manually reviewing just 3%
| of the audio.
| notahacker wrote:
| ML tending to make _weird_ mistakes rather than subtle
| ones that make sense in context like human transcribers
| is likely to make them easier to spot.
|
| And there are humans in the loop too, and an enormous
| amount of redundancy in the questions and answer, so even
| plausible false transcriptions will get picked up on if
| they matter. Nobody gets sent to jail simply because the
| transcription process - human or machine - accidentally
| substitutes "I did it" in place of "I didn't" midway
| through a two hour interview.
| BartjeD wrote:
| The thing is that 'Likely' is very far away from
| 'always'. There is no guarantee the mistake will be easy
| to spot.
|
| For entertainment purposes AI transcription is awesome.
|
| For serious business applications the ability to
| recognize mistakes will continue to be a field to which
| serious attention is given. It would be interesting to
| see AI processes double check itself, and also run a
| logic check on whether the transcription makes sense. So
| that it can report sections flagged as incongruous or of
| dubious reliability.
| iroh2727 wrote:
| +1. There is a widespread "metric fallacy" or "task
| fallacy" going around. Models of course optimize for
| metrics, so they tend to perform well on those related
| metrics.
|
| Humans, however, are not simply metric optimizers. Though
| it's always in the interest of those corporations
| producing metric optimizers (i.e. models) to paint humans
| as such, so their models shine in comparison. They want
| humans to look like bad machines, so it looks like they
| should be automated. Not to say they shouldn't in many
| cases, just that there's a clear one-sidedness in all
| corporate PR (and funded research, especially that
| research which is also PR).
|
| All this to say that yes I agree with you. And if we
| humans don't want our unsustainable economic growth to
| turn us even more into machines (as our bureaucratic
| creep has done quite well thus far), we should fight such
| rhetoric that aims to paint humans simply as machines or
| task-doers.
| golem14 wrote:
| One would think that the few crucial bits of information
| gleaned are listened to manually, and the machine
| translation is not the only thing the judge or a jury sees.
| thfuran wrote:
| You have absolutely ruined someone's day way before
| they're sitting in front of a jury.
| formerly_proven wrote:
| Stuff like that is a very good tell that someone has zero
| experience with law enforcement.
| CTDOCodebases wrote:
| I imagine a certain percentage of a given population is on
| a voice call at any one time.
|
| 1. Set up a computer with voice recognition software that
| flags certain patterns.
|
| 2. Connect computer to voice call communication network.
|
| 3. Configure computer to switch between calls every x
| number of seconds.
|
| Think of it like a system to generate leads for law
| enforcement that can be integrated with other systems to
| produce the best quality leads.
| NaturalPhallacy wrote:
| This is called "a fishing expedition" and is _wildly_
| unconstitutional in the US.
|
| > _The right of the people to be secure in their persons,
| houses, papers, and effects, against unreasonable
| searches and seizures, shall not be violated, and no
| Warrants shall issue, but upon probable cause, supported
| by Oath or affirmation, and particularly describing the
| place to be searched, and the persons or things to be
| seized._
| CTDOCodebases wrote:
| Are you sure about that? [0]
|
| Besides I wasn't talking about the USA when I said this.
| I was remembering a conversation I once had with a person
| who worked as a technician in a telephone exchange.
|
| [0] - https://en.wikipedia.org/wiki/Jewel_v._NSA
| jjoonathan wrote:
| Yes, it is wildly unconstitutional, but in practice don't
| the courts endorse the asinine "it's not a search unless
| we find something" argument from the NSA?
|
| Power always just finds a way to rationalize what it
| wants to do.
| kurisufag wrote:
| see: Operation PRISM
| Thorentis wrote:
| Not really. Imagine that they do simple keyword matching on
| the text. Anything that's missed (part of the 97%) the
| criminals get away with. Anything that matches in the 3% is
| then checked by a human (by listening to the audio at that
| time stamp). So you only need to manually check the 3%, and
| even then only if something you're interested in is found.
| j-krieger wrote:
| I've worked with similar technology in the law enforcement
| space and the software is never used to make decisions. You
| can make out critical timestamps in conversations and a law
| enforcement officer will always manually confirm the
| software's assessments.
| JohnFen wrote:
| Given that law enforcement has made similar claims about
| technology use in the past that turned out to be false, I
| have no faith in this claim.
| hadlock wrote:
| Microsoft announced their voice transcription technology a
| couple years ago and were also touting ~97-98% accuracy
| which was actually _better_ than human transcription error
| rates. The errors are usually in part people garbling their
| own speech, or they move their head while talking and the
| microphone misses a syllable. Anything in that error bar
| would probably fall under "reasonable doubt"
| kyriakos wrote:
| If its anything like Microsoft teams transcription I
| doubt the 97%+ accuracy.
| knaik94 wrote:
| It seems far from good with mixed language content, especially
| with English and Japanese together. The timestamps are far from
| perfect. It's far from perfect. It's nowhere close to human for
| the more ambiguous translations that depend on context of word.
| It's far below what anyone that spoke either language would
| consider acceptable. Maybe it's unfair to use music, but music
| is the most realistic test of whether it's superior to the
| average human.
| quickthrower2 wrote:
| Some music is hard for even people to make out the lyrics to.
| soheil wrote:
| Their name reminds of the company McDonald's uses to supply
| their beef called 100% Pure Beef Inc. so they can say 100% Pure
| Beef on their menu.
| space_fountain wrote:
| This seems to not be true for McDonald:
| https://www.snopes.com/fact-check/mcdonalds-100-beef/
| cutierust wrote:
| soheil wrote:
| This article seems very suspect to me. This is the main
| reason they assert why the claim is false:
|
| "While this is a fascinating premise, there's nothing to
| it: McDonald's hamburger patties in the U.S. are made with
| 100% USDA-inspected beef. They are cooked and prepared with
| salt, pepper and nothing else; no preservatives, no
| fillers.
|
| McDonald's of Australia's "Make Up Your Own Mind" web site
| said the following of the rumor in its Top FAQs section:
| Is it true that McDonald's created a company called "100%
| Australian Beef" just so they can say that in their
| advertising? No."
|
| So if I'm McDonald's and want to squash a negative story
| why not throw a few bucks at the pinnacle of journalism
| that is Snopes? (formerly Urban Legends Reference Pages)
| space_fountain wrote:
| This isn't exactly a hard story to fact check. There is 0
| evidence for this in either the reddit thread or really
| anywhere? If they were willing to lie about the company
| name why not just lie about the beef in their burgers it
| would be equally scandalous
| soheil wrote:
| The company name could be 100% legit, there is nothing
| stopping you from a forming a company with that name and
| not even sell beef.
| sam_goody wrote:
| It definitely happens.
|
| There are at least two companies that have branded [..]
| Kosher Gelatin(tm). One of them makes gelatin that is
| considered non-kosher by all of the major kashrus
| agencies.
|
| "Kosher Gelatin(r)", when in the ingredients, just means
| the product contains pork.
| samatman wrote:
| I believe that you believe this, but you got had. Pretty
| funny though.
| mrtranscendence wrote:
| For what it's worth, I've spent a few minutes googling
| and can't find any story that corroborates this. The only
| US trademark I can find around "kosher gelatin" is by the
| brand Kolatin, which is apparently certified Kosher.
| jsight wrote:
| You are right, it could be. The problem is that its the
| kind of thing that would be almost impossible to disprove
| if it were false. So you can always raise doubts about a
| supposed disproof.
|
| But it'd be really easy to prove if it were true and
| noone has offered proof. And there've been plenty of
| people who've looked for such proof, afaict.
|
| My default assumption in such cases is that it is likely
| false.
| jefftk wrote:
| If this was more than an urban legend someone would be
| able to dig up a company with this name and some
| indication that McD was working with them.
| pessimizer wrote:
| Something being possible to do isn't enough evidence for
| rational people to believe that it happened. From my
| perspective, it's possible that you're Iron Mike Tyson,
| or that you died after your last comment and this one was
| posted by the assassin who killed you.
| soheil wrote:
| What? I never said it's evidence that it did happen,
| please don't make things up. I just pointed out the
| evidence provided to refute the claim is possibly
| invalid.
| pessimizer wrote:
| You haven't offered any evidence is the point.
| soheil wrote:
| Because I'm not trying to prove that it did or not, but
| rather make parallels between that and OpenAI's name. For
| I care it could be an urban legend, but who cares that's
| not the point.
| [deleted]
| whichfawkes wrote:
| In the US, for a while I remember we had billboards
| advertising McDonald's burgers as being "1 <hamburger>
| <hamburger>% beef". Because the hamburgers were of course
| circular, it looked kind of like "100%".
|
| I remember thinking that surely an image of a hamburger
| does not legally constitute a zero.
| leobg wrote:
| Seems like this is an urban legend.
|
| https://www.reddit.com/r/IsItBullshit/comments/2rztov/isitbu.
| ..
| soheil wrote:
| This seems to be primarily based on the referenced Snopes
| article https://news.ycombinator.com/item?id=32929237
| amelius wrote:
| If consumer laws are so easily circumvented then I have
| little respect for those making these laws.
| [deleted]
| bambax wrote:
| The French version is a little contrived. The speaker is a
| native speaker, but the text is obviously the result of a
| translation from English to French, not idiomatic French.
|
| I will try to put the code to the test, see how it goes.
| octref wrote:
| I'm interested in building something with this to aid my own
| French learning. Would love to read your findings if you end
| up posting it somewhere like twitter/blog!
| bambax wrote:
| Tried again with Blaise Pascal -- the famous fragment of a
| letter where he says he's sorry he didn't have enough time
| to make it shorter.
|
| Original:
|
| > _Mes reverends peres, mes lettres n'avaient pas accoutume
| de se suivre de si pres, ni d'etre si etendues. Le peu de
| temps que j'ai eu a ete cause de l'un et de l'autre. Je
| n'ai fait celle-ci plus longue que parce que je n'ai pas eu
| le loisir de la faire plus courte. La raison qui m'a oblige
| de me hater vous est mieux connue qu'a moi. Vos reponses
| vous reussissaient mal. Vous avez bien fait de changer de
| methode ; mais je ne sais si vous avez bien choisi, et si
| le monde ne dira pas que vous avez eu peur des
| benedictins._
|
| Transcription:
|
| > Mes reves errent peres, mais l'detre navais pas accoutume
| de se suivre de si pres ni d'detre si etendu. Le peu de
| temps que j'sais eu a ete cause de l'de l'de l'de autre.
| J'sais n'detre plus longue que parce que j'sais pas eu le
| loisir de la faire plus courte. La raison qui m'sa obligee
| de me hater vous est mieux connue qu'moi. Vos reponses vous
| reussissaient mal. Vous avez bien fait de changer de
| methode, mais je ne sais pas si vous avez bien choisi et si
| le monde ne dira pas que vous avez eu peur des benedictes.
|
| Here there are many more mistakes, so many that the
| beginning of the text is unintelligible. The language from
| the 17th century is probably too different. Still on the
| "medium" model, as the large one crashes the Colab (not
| sure how to select a beefier machine.)
|
| Still fascinating and exciting though.
| wazoox wrote:
| Depends on the way you're pronouncing it maybe. To be
| intelligible IMO it must be read differently from a
| modern text, with well sounding liaisons, and all vowels
| very distinct: "un" sounds differently from "in", "a"
| clearly differs from "a", "ai" and "e" from "e" and for
| instance the "e" in "etendues" must be pronounced, though
| not loudly.
|
| My test gives that, much better than yours:
|
| _Mes *reverants* peres, mes lettres n 'avaient pas
| accoutume de se suivre de si pres ni d'etre si etendues.
| Le peu de temps que j'ai eu a ete cause de l'un et de
| l'autre. Je n'ai fait celle aussi plus longue que parce
| que je n'ai pas eu le loisir de *l'af*faire plus courte.
| La raison qui m'a oblige de me *ra*ter vous est mieux
| connue qu'a moi. Vos reponses vous reussiss*ez* mal. Vous
| avez bien fait de changer de methode. Mais je ne sais si
| vous avez bien choisi et si le monde ne dira pas que vous
| avez eu peur des benedict*eurs*._
| bambax wrote:
| I'm playing with a Colab posted in this thread
| (https://news.ycombinator.com/item?id=32931349), and it's
| incredibly fun and accurate!
|
| I tried the beginning of L'etranger (because you seem to be
| a fan of Camus ;-)
|
| Here's the original:
|
| > _Aujourd'hui, maman est morte. Ou peut-etre hier, je ne
| sais pas. J'ai recu un telegramme de l'asile : << Mere
| decedee. Enterrement demain. Sentiments distingues. >> Cela
| ne veut rien dire. C'etait peut-etre hier._
|
| > _L'asile de vieillards est a Marengo, a quatre-vingts
| kilometres d'Alger. Je prendrai l'autobus a deux heures et
| j'arriverai dans l'apres-midi. Ainsi, je pourrai veiller et
| je rentrerai demain soir. J'ai demande deux jours de conge
| a mon patron et il ne pouvait pas me les refuser avec une
| excuse pareille. Mais il n'avait pas l'air content. Je lui
| ai meme dit : << Ce n'est pas de ma faute. >> Il n'a pas
| repondu. J'ai pense alors que je n'aurais pas du lui dire
| cela. En somme, je n'avais pas a m'excuser. C'etait plutot
| a lui de me presenter ses condoleances._
|
| Here's the transcription:
|
| > Aujourdhui, maman est morte, peut etre hier, je ne sais
| pas. J''ai recu un telegramme de l''asile. Mere decedee,
| enterrement demain, sentiment distingue. Cela ne veut rien
| dire. C''etait peut etre hier.
|
| > L''asile de Vieillard est a Maringot, a 80 km d''Alger.
| Je prendrai l''autobus a deux heures et j''arriverai dans
| l''apres midi. Ainsi, je pourrai veiller et je rentrerai
| demain soir. J''ai demande deux jours de conge a mon patron
| et il ne pouvait pas me les refuser avec une excuse
| pareille. Mais il n''avait pas l''air content. Je lui ai
| meme dit, ce n''est pas de ma faute. Il n''a pas repondu.
| J''ai alors pense que je n''aurais pas du lui dire cela. En
| somme, je n''avais pas a m''excuser. C''etait plutot a lui
| de me presenter ses condoleances.
|
| Except for the weird double quotes instead of the single
| apostrophe ('), it's close to perfect, and it only uses the
| "medium" model.
|
| This is extremely exciting and fun! Happy to try other
| texts if you have something specific in mind!
| bambax wrote:
| Last try for tonight with Baudelaire.
|
| Original: Trois mille six cents fois par
| heure, la Seconde Chuchote Souviens-toi !- Rapide,
| avec sa voix D'insecte, Maintenant dit Je suis
| Autrefois, Et j'ai pompe ta vie avec ma trompe
| immonde ! Remember ! Souviens-toi ! prodigue !
| Esto memor ! (Mon gosier de metal parle toutes les
| langues ) Les minutes, mortel folatre, sont des
| gangues Qu'il ne faut pas lacher sans en extraire
| l'or !
|
| Transcription:
|
| > Trois mille six cents fois par heure, la seconde chuchote
| << Souviens toi >>, rapide, avec sa voix d''insecte,
| maintenant dit << Je suis autrefois >>, et j''ai pompe ta
| vie avec ma trompe immonde. << Remember, souviens toi,
| prodigue, est au memoire, mon gosier de metal, parle toutes
| les langues, les minutes, mortelles folatres, sont des
| gangs qu''il ne faut pas lacher sans en extraire l''or. >>
|
| Not bad! Far from perfect but it's a difficult text.
| Interesting that it works better with Baudelaire than
| Pascal.
| pen2l wrote:
| Interesting, I'm a non-native French speaker, the original
| French piece struck me as being entirely normal (but maybe it
| was just the perfect French accent that swayed me). Can you
| please point out what he said which wasn't idiomatic or
| naturally-worded French?
| bambax wrote:
| Little details. The second sentence is really bizarre:
|
| > _Nous etablissons que l 'utilisation de donnees d'un tel
| nombre et d'une telle diversite est la raison pour laquelle
| le systeme est a meme de comprendre de nombreux accents..._
|
| It doesn't sound natural at all. An idiomatic formulation
| would be more along the lines of:
|
| _Le recours a un corpus [de donnees] si riche et varie est
| ce qui permet au systeme de comprendre de nombreux accents_
| (With 'corpus', 'donnees' is implied.)
|
| Of course this is just an example, and I'm sure other
| French speakers could come up with a different wording, but
| "donnees d'un tel nombre et d'une telle diversite" sounds
| really wrong.
|
| This is also weird and convoluted:
|
| > _Nous distribuons en tant que logiciel libre le code
| source pour nos modeles et pour l 'inference, afin que
| ceux-ci puissent servir comme un point de depart pour
| construire des applications utiles_
|
| It should at least be "le code source DE nos modeles" and
| "servir DE point de depart", and "en tant que logiciel
| libre" should placed at the end of the proposition (after
| 'inference').
|
| Also, "construire" isn't used for code but for buildings,
| and "applications utiles" is unusual, because "utiles"
| (useful) is assumed. "...pour le developpement de nouvelles
| applications" would sound more French.
| [deleted]
| aGHz wrote:
| That's interesting, as a quebecois I don't agree with any
| of this. The only thing that raised an eyebrow was "est a
| meme de", but if turns out it's just another way of
| saying "capable de", I guess it's simply not a common
| idiom around here. Aside from that, I found the wording
| flowed well even if I personally would've phrased it
| differently.
| slim wrote:
| Mistery solved. It was a quebecois
| mazork wrote:
| Gonna have to agree with the other reply, as a french-
| canadian, except for "servir comme un point de depart"
| which should be "servir de point de depart", that all
| sounds perfectly fine.
| bambax wrote:
| If this is actually "good" or even acceptable French
| Canadian, then it's a different language from French (and
| the blog post should mention it).
|
| I kind of doubt it though -- the speaker doesn't have a
| Canadian accent (which is hard to miss), and in my
| (admittedly limited) experience, French Canadian isn't
| that different from French.
| OrangeMusic wrote:
| How funny to see that to French people, Quebec french
| sounds like machine translated english :)
| _plg_ wrote:
| At the start, the "Nous etablissons" part, for example. You
| wouldn't write that if you were starting scratch from
| French.
| otikik wrote:
| That's the first thing that I discovered when I visited
| Paris for the first time.
|
| No one says "Nous", there, ever. Perhaps the politicians,
| while giving a speech. Everyone else uses the more
| informal "On".
|
| I felt duped by my French classes.
| not_math wrote:
| You can see from the transcript where the model made some
| errors, for example:
|
| > We distribute as a free software the source code for our
| models and for the inference [...]
|
| Should be
|
| > We are open-sourcing models and inference code [...]
|
| Another example
|
| > We establish that the use of such a number of data is
| such a diversity and the reason why our system is able
| [...]
|
| Should be
|
| > We show that the use of such a large and diverse dataset
| leads to improved robustness [...]
| DLeychIC wrote:
| try it out here: https://huggingface.co/spaces/openai/whisper
| Workaccount2 wrote:
| Can't wait to see twelve new $49.99/mo speech parser services
| pop up in the next few weeks.
| quickthrower2 wrote:
| Make hay before Google gives away free hay.
|
| That said there is value in integration of this into other
| things.
| quickthrower2 wrote:
| This has been running on my laptop all day for a 15 min
| mp3! Definitely not cheap to run then (wont imagine how
| much AWS compute cost is required).
| darepublic wrote:
| > Neat, https://github.com/openai/whisper - they have open-
| sourced it, even the model weights, so they are living up to
| their name in this instance.
|
| Perhaps it will encourage people to add voice command to their
| apps, which can be sent to gpt3
| pabs3 wrote:
| Is the training dataset and code open too?
| suyash wrote:
| More of this is welcome, they should live up their name and
| original purpose and share other models (code, weights,
| dataset) in the open source community as well.
| catfan wrote:
| Simorgh wrote:
| I've been experimenting with voice-interfaces where typing is
| replaced by talking, but I find it hard to transition users to
| voice - we 'seem' to prefer typing to talking.
|
| I wonder if this will change.
| ironlake wrote:
| Personally, I would rather type than talk when interacting with
| a computer. The only time I use voice interfaces are when the
| physical interface is so poor it's just easier to use voice.
| Apple TV devices are an example of this.
| shpx wrote:
| We shouldn't call this open source. The model definition + the
| data is the source code. The model weights are a compilation
| artifact.
|
| > The source code must be the preferred form in which a
| programmer would modify the program. [...] Intermediate forms
| such as the output of a preprocessor or translator are not
| allowed.
|
| > https://opensource.org/osd
|
| If I asked a programmer from OpenAI to modify the model to better
| support Japanese speakers from Hokkaido, their "preferred form"
| of the model's source code would include the 680,000 hours of
| audio used to train the model.
|
| Yes that means that there are almost no open source models and
| yes it's awesome that they released this and made the weights
| available. Just don't call it open source.
| nl wrote:
| This isn't really true.
|
| You can do a lot with weights and no training data - for
| example you can pull the end layer off it and use it as a
| feature extractor.
|
| And to modify it for Japanese speakers you'd fine train the
| existing model on additional data. If you wanted to modify the
| model you can (sometimes, depending on what you want to do)
| modify an existing architecture by removing layers, adding
| replacements and fine tuning.
|
| I don't quite know what the right analogy of trained data is.
| In many ways it is more valuable than the training data because
| the compute needed to generate it is significant. In other ways
| it is nice to be able to inspect the data.
|
| > The source code must be the preferred form in which a
| programmer would modify the program.
|
| As a machine learning programmer I'd much prefer the weights
| than the raw data. It's no realistic for me to use that
| training data in any way with any compute I have access to.
| rvz wrote:
| Yes. It just like calling the release of compiled closed binary
| blobs as 'open source' even when the source of reproducing the
| compiled output is unavailable.
|
| > If I asked a programmer from OpenAI to modify the model to
| better support Japanese speakers from Hokkaido, their
| "preferred form" of the model's source code would include the
| 680,000 hours of audio used to train the model.
|
| Precisely. These 'users' lifting the model can't do it
| themselves. You will still be contacting OpenAI for support or
| to add support for another language and they will be the ones
| able to modify the model.
|
| > Just don't call it open source.
|
| That is true, it is still closed source and already we are
| seeing the hype squad already apologising to OpenAI as they
| 'open sourced' a closed model that you can't modify yourself.
|
| OpenAI is still business as usual and nothing has changed.
| MacsHeadroom wrote:
| >You will still be contacting OpenAI for support or to add
| support for another language and they will be the ones able
| to modify the model.
|
| This isn't quite correct. The model weights are all you need
| to fine tune the data on your own with your own audio.
|
| Without the original training set this still isn't open
| source. But you aren't powerless to modify the model without
| the original training set.
| pabs3 wrote:
| The Debian deep learning team's machine learning policy would
| call this a "toxic candy" model:
|
| https://salsa.debian.org/deeplearning-team/ml-policy
|
| BTW, wouldn't you take the existing model and do additional
| Hokkaido Japanese speaker training on top of it, rather than
| retraining the model from scratch?
| lfmunoz4 wrote:
| [deleted]
| sergiotapia wrote:
| Does this work with multiple speakers?
|
| I want to build a tool that takes a video and generates subtitles
| for it, then I want to index the subtitles and let people search
| for a specific quote to scrub to that part of the video using
| automatically generated urls.
|
| This is for a specific fandom of a ton of content, lots of dirty
| audio mostly recorded in a gym setting with multiple people
| speaking.
| 867-5309 wrote:
| pretty sure such a tool made HN front page a few months ago
| isoprophlex wrote:
| Really incredible to see that their multilingual audio-to-English
| approach is viable. I'm super excited about this, and great to
| see that openai actually open up about something, for once.
|
| Skimming the codebase I can't immediately see code to do
| additional training.
|
| Being able to fine-tune the model to a specific language or case
| (eg. teach it specifically about some technical topic that might
| not be so prevalent in the current train set) would be majorly
| disruptive to current SOTA in "callcenter analytics" tech.
| Especially when combining Whisper with GPT3.
| samstave wrote:
| AI speech recognition FN scares the heck out of me...
|
| for so many reasons.
|
| But one that really pisses me off is not being able to turn it
| off on the iphone, and the fact that aside from "hidden cameras
| in my airBnB" -- soon we will have to worry about secret
| listening machines EVERYWHERE
| jfoster wrote:
| Also, based on their demo, this model seems like it might have
| comprehension well above the level of a typical human.
|
| Anyway, it's out there now. No way to turn back.
| ma2rten wrote:
| We will see an explosion of AI capabilities in the next couple
| of years. This will have a huge impact on our lives, much of it
| good but some of it also bad.
| samstave wrote:
| "Good" for ensuring you're a compliant consumer - bad if
| you're an individual person
| wongarsu wrote:
| "Secret listening machines everywhere" was a pretty big thing
| in East Germany. It's also the central theme of the movie The
| Lives of Others.
|
| Of course, the ability to scale this more cheaply (throwing
| more compute at it, instead of more people) is somewhat scary,
| but it's not really introducing a new capability. Especially
| since you still have to do something with the transcript. An
| AirBnB landlord who reads the transcript of what you said could
| as well have listened to the recording.
| ALittleLight wrote:
| I think it's a new capability to add good speech to text,
| search, and models that can understand and process text. You
| have microphones recording speech everywhere, models turning
| that speech into easily searchable text, and something like
| GPT-3 reading all the speech and raising red flags for any
| transgressive idea you please.
| samstave wrote:
| Yes, and if you want AI that is searching for "dissenters"
| we shall soon have "speech police" or tickets or some
| format of authoritarian punitive actions powered by this
| zappy42 wrote:
| "John Spartan, you have been fined one credit for
| violation of the Verbal Morality Statute."
| jffry wrote:
| I'd argue that cheap, pervasive, always-on surveillance with
| a backlog of searchable transcriptions is a qualitatively
| different capability.
| samstave wrote:
| Exactly.
|
| We are entering the next era...
|
| The Kurzweil podcast appearance on Lex Fridman is nuts and
| while I love kurzweil, holy crap even with my distopian
| outlook he makes it even worse when you listen to even half
| of it...
| samstave wrote:
| Exactly - imagine when we get to the point where,
| regardless of your "crime", your punishment is 'augmented'
| by the " _thing that you said in the past_ " AND when it
| starts to be able to connect to APIs of your
| social/whatever accounts and AI-Auto-Cancel you....
|
| Basically digital assassination.
| gareth_untether wrote:
| I'm thinking of releasing a plugin in for Unity to that can be
| used to match a phrase to an action. Seeing Whisper is making me
| think I should include a way to use voice and not just text.
| nothrowaways wrote:
| Great project, not so great package name.
| aidenn0 wrote:
| I just threw a random rock MP3 at it, and a first readthrough
| shows no transcription errors; this is quite good.
|
| Now I just want OCR that's even 50% as good as this...
| aidenn0 wrote:
| Ran a few other songs through it and found one obvious
| mistranscription:
|
| "He's the bedroom cosmic rocker" (should be "He's the veteran
| cosmic rocker" in _Veteran Cosmic Rocker_ by The Moody Blues)
|
| I also noticed that it's a little on the conservative side for
| detecting speech; all songs were missing at least part of one
| line.
| aidenn0 wrote:
| Ran it on _Juicy_ by The Notorious B.I.G and results were
| considerably worse than my mix of prog-rock and british
| invasion music I had tried before, though at least some of
| that is due to the number of proper-nouns in that song.
|
| It took about 1000 CPU-minutes for this 5 minute song on my
| Ryzen 2700 with 12 OpenMP threads (about 100 minutes wall-
| clock).
| antegamisou wrote:
| Here's the output of whisper never-gonna-
| give-you-up.mp3 --language English --model small
| [00:00.000 --> 00:27.000] We're no strangers to love You
| know the rules and so do I [00:27.000 -->
| 00:35.000] I feel commitments while I'm thinking of You
| wouldn't get this from any other guy [00:35.000 -->
| 00:43.000] I just wanna tell you how I'm feeling Gotta
| make you understand [00:43.000 --> 00:47.000]
| Never gonna give you up Never gonna let you down
| [00:47.000 --> 00:53.000] Never gonna run around and
| desert you Never gonna make you cry [01:00.000 -->
| 01:09.000] We've known each other for so long Your heart's
| been aching but you're too shy to say [01:09.000
| --> 01:17.000] Inside we both know what's been going on We
| know the game and we're gonna play it
|
| It was running for quite a long time (20 minutes) on my
| admittedly low-budget specs.
|
| Note that I did not omit 00:53.000 -> 01:00.000.
|
| Shouldn't there be some type of unintelligible warning
| since it wasn't able to transcribe that part?
| aidenn0 wrote:
| Model small is about as good at recognizing lyrics as an
| untrained Newton was at recognizing handwriting.
|
| Here's a comparison of _Basket Case_ by Greenday:
|
| Small: [00:00.000 --> 00:05.000] Do
| you have the time to listen to me whine
| [00:05.000 --> 00:10.000] About nothing and everything
| I'll have once? [00:11.000 --> 00:16.000] I am
| one of those melodramatic fools [00:16.000 -->
| 00:20.000] Neurotic to the bone, no doubt about it
| [00:23.000 --> 00:27.000] Sometimes I give myself the
| creeps [00:27.000 --> 00:32.000] Sometimes my
| mind plays tricks on me [00:32.000 --> 00:38.000]
| It all keeps headed up, I think I'm pregnant
| [00:38.000 --> 00:43.000] And I'm just paranoid, I'm
| just stuck [00:47.000 --> 00:52.000] I went to a
| shrink to have a life like my dreams [00:52.000
| --> 00:57.000] She says it's like a sex that's bringing
| me down [00:57.000 --> 01:03.000] I went to a
| whore, he said my life's a bore [01:03.000 -->
| 01:08.000] Choked with my widest buzz that's bringing
| her down [01:10.000 --> 01:14.000] Sometimes I
| give myself the creeps [01:15.000 --> 01:19.000]
| Sometimes my mind plays tricks on me [01:19.000
| --> 01:25.000] It all keeps headed up, I think I'm
| pregnant [01:25.000 --> 01:30.000] And I'm just
| paranoid, I'm just stuck [01:30.000 -->
| 01:48.000] Grasping to control, it's all I better hold
| on [02:08.000 --> 02:12.000] Sometimes I give
| myself the creeps [02:13.000 --> 02:17.000]
| Sometimes my mind plays tricks on me [02:18.000
| --> 02:23.000] It all keeps headed up, I think I'm
| pregnant [02:23.000 --> 02:30.000] And I'm just
| paranoid, I'm just stuck [02:53.000 -->
| 03:13.000] Thanks for watching!
|
| Medium: [00:00.000 --> 00:05.000] Do
| you have the time to listen to me whine
| [00:05.000 --> 00:10.000] About nothing and everything
| all at once? [00:11.000 --> 00:16.000] I am one
| of those melodramatic fools [00:16.000 -->
| 00:20.000] Neurotic to the bone, no doubt about it
| [00:23.000 --> 00:27.000] Sometimes I give myself the
| creeps [00:27.000 --> 00:32.000] Sometimes my
| mind plays tricks on me [00:33.000 --> 00:36.000]
| It all keeps adding up [00:36.000 --> 00:39.000]
| I think I'm cracking up [00:39.000 --> 00:41.000]
| Am I just paranoid? [00:41.000 --> 00:43.000] Am
| I just sad? [00:47.000 --> 00:50.000] I went to
| a shrink [00:50.000 --> 00:53.000] To analyze my
| dreams [00:53.000 --> 00:58.000] She says it's
| lack of sex that's bringing me down [00:58.000
| --> 01:01.000] I went to a whore [01:01.000 -->
| 01:04.000] He said my life's a bore [01:04.000
| --> 01:09.000] So quit my whining cause it's bringing
| her down [01:10.000 --> 01:14.000] Sometimes I
| give myself the creeps [01:16.000 --> 01:20.000]
| Sometimes my mind plays tricks on me [01:20.000
| --> 01:23.000] It all keeps adding up [01:23.000
| --> 01:26.000] I think I'm cracking up
| [01:26.000 --> 01:28.000] Am I just paranoid?
| [01:28.000 --> 01:30.000] Am I just sad?
| [01:40.000 --> 01:44.000] Grasping to control
| [01:44.000 --> 01:50.000] So I better hold on
| [02:07.000 --> 02:11.000] Sometimes I give myself the
| creeps [02:11.000 --> 02:16.000] Sometimes my
| mind plays tricks on me [02:16.000 --> 02:19.000]
| It all keeps adding up [02:19.000 --> 02:22.000]
| I think I'm cracking up [02:22.000 --> 02:24.000]
| Am I just paranoid? [02:24.000 --> 02:52.000] Am
| I just sad? [02:54.000 --> 02:58.000] Thanks for
| watching!
| macrolocal wrote:
| For what it's worth, even the large model balks on Easy
| (Aesop Rock), eg.
|
| "Fountainheads spittle sniglets quicker than quidditch
| seekers snatch golden snitches."
|
| becomes
|
| "Stirred up out mids bittles, snicklets, cricket and
| quidditch seekers net golden snitches."
|
| -\\_(tsu)_/-
| aidenn0 wrote:
| Large was not obviously better than medium when I tried it.
| My impression was that it tended to fit more to a language
| model than the sounds heard, which corrected some errors
| and introduced some others, but I didn't try a lot of songs
| because large won't run on my GPU.
| sjsdaiuasgdia wrote:
| I was comparing a batch of transcriptions between these models
| and vosk, and noticed that the medium.en model produces some
| weird results compared to the others. I've seen a number of loops
| with one word or a small sequence of words repeating several
| times. It seems more prone to output that reads like nonsense
| than the others.
|
| More troubling is a short audio clip that got a few full
| sentences back, several times the text length that comes back
| from the other models or vosk. The content of the sentences is
| extremely far from the audio content. The best alignment I can
| find is the first word of medium.en's interpretation is somewhat
| phonetically similar to the audio.
|
| The small.en model doesn't show these behaviors, at least in this
| data set.
| nshm wrote:
| The whole value of this model is in 680 000 hours of training
| data and to reuse this value you need large model, not smaller
| ones. Smaller versions just don't have enough capacity to
| represent training data properly.
| powera wrote:
| My first take: it is slow.
|
| The "base" model (supposedly 16x faster than the large one) takes
| more than the audiofile playback time on my machine to do
| transcriptions.
| fitznd wrote:
| I'm seeing even worse. On my M1 Max 2021 macbook pro, I tried
| transcribing a 30 minute video file and left it on overnight
| and it was only half way through. I feel like something could
| be wrong with my setup but I'm only using the defaults.
| archibaldJ wrote:
| Is this practical to be used on the "edge" (for voice-control)?
| Would love to know if anyone has a rough idea roughly how
| fast/slow this would be on a M1 Mac or V100
| LoveMortuus wrote:
| This could be used to make some really cool RPG games!
| funhighway wrote:
| Would be nice to give more details about the provenance and
| construction of the training data.
| [deleted]
| catfan wrote:
| rlt wrote:
| As a casual observer I get the sense that OpenAI and others are
| very rapidly creating building blocks of something much bigger...
| StevenWaterman wrote:
| That example at the top of the page (speed talking) blew me away.
| He started talking, I was stunned for a minute, then realised
| yes, it really was English, and I just burst out laughing.
|
| That's so, so far beyond the previous state-of-the-art, it's
| absurd.
| NaturalPhallacy wrote:
| It's a micromachines ad from the '80s. He talked like that in
| all of them!
|
| As for speed, to a computer we don't talk very fast, not even
| that guy.
|
| I wonder if it could handle Rap God by Eminem....Let's find
| out!
| dreamer7 wrote:
| Did you find out :D?
| arpankapoor wrote:
| I did! There are a few places it transcribes incorrectly,
| but overall I'm very impressed. Here's the first ~30
| seconds: [00:00.000 --> 00:09.000] Look,
| I was going to go easy on you, not to hurt your feelings,
| but I'm only going to get this one chance.
| [00:09.000 --> 00:11.000] Something's wrong, I can feel
| it. [00:11.000 --> 00:17.000] It's just a feeling
| I've got, like something's about to happen, but I don't
| know what. [00:17.000 --> 00:21.000] If that means
| what I think it means, we're in trouble, big trouble.
| [00:21.000 --> 00:24.000] Had to be as bananas as you say,
| I'm not taking any chances. [00:24.000 -->
| 00:26.000] You're just one to die for. [00:26.000
| --> 00:32.000] I'm beginning to feel like a rap god, rap
| god. All my people from the front to the back nod, back
| nod.
| madacol wrote:
| NaturalPhallacy wrote:
| It was doing it _slowly_ , but hadn't got to the insane bit
| when I killed it to try and get it working with CUDA, so I
| had to do some digging and it turns out I need a version of
| pytorch with CUDA enabled, and so I had to go and install
| Anaconda, and now now conda is stuck trying to "solve" my
| environment to install pytorch with CUDA.
|
| So...probably?
|
| Pre-post edit: I can't get it to work.
|
| I've installed pytorch with cuda via pip3, installed the
| nVidia toolkit and it doesn't see it:
|
| >>> import torch >>> torch.cuda.is_available() False
|
| I've wasted like an hour and a half on it now. I'm not a
| python dev, and don't have any ML experience so this was
| just for fun and now it's not anymore.
| mlboss wrote:
| Try running pytorch/pytorch docker. But you will need
| nvidia container runtime installed. I am sure somebody
| will soon release docker for this also.
| forgingahead wrote:
| Welcome to every single Python ML project - dependency
| hell will quickly kill any enthusiasm one may have for
| trying out projects. It really feels archaic to have
| these issues with such cutting edge technology.
| MayeulC wrote:
| You can blame CUDA quite a bit for that. Proprietary, you
| need to sort out which driver you need, plus an nvidia
| GPU...
|
| I tried compiling pytorch with vulkan support, but there
| are a few LDFLAGS that are wrong. I'll try to solve that
| some time later.
|
| One piece of advice: use distribution packages! Arch
| provides pytorch-cuda, and has PKGBUILDS as well.
|
| For reproductibility, I wish we were all on Nix/Guix, but
| that's not the case (and CUDA+HW dependency would make it
| complicated).
| forgingahead wrote:
| CUDA is not the problem, the problem is crappy code being
| released on Github where basic things like
| requirements.txt are missing, never mind an earnest
| attempt to provide details about the environment that the
| code was running on. This is on top of code that has lots
| of hard-coded references to files and directories, plus
| also many python libraries just breaking compatibility
| with each other on point releases.
|
| I can't find a source now, but I remember reading some
| code where the maintainer had to change a huge chunk of
| code because the point change for a dependency library
| literally flipped either how the library handled
| height/width or BGR channels (I can't remember which one
| but it was preposterous) from the 2.5.4 to the 2.5.5
| version. There is no reason for doing that - it breaks
| everything just for grins and giggles.
|
| Python itself is also a problem, but that's a rant for
| another day. Ah, how I wish Ruby had become the defacto
| language of choice for ML/Deep Learning!
| catfan wrote:
| londons_explore wrote:
| @dang Can we change the link to the github here[1]?
|
| It seems to describe the project better for a technical audience.
|
| [1]: https://github.com/openai/whisper
| toss1 wrote:
| Like every model I've seen there is something like this:
|
| >>A decoder is trained to predict the corresponding text...
|
| Prediction of expected text in the context of the previous text.
|
| While this is valuable in casual transcription, it can be
| extremely dangerous in serious contexts.
|
| From personal experience, having given a deposition with an "AI"
| transcription, it will literally reverse the meanings of
| sentences.
|
| This is because it produces the _EXPECTED_ output in a context,
| and _NOT THE ACTUAL OUTPUT_.
|
| Like a speaker that clips the output, these types of systems
| 'clip' the really valuable information out of a transcription.
| Worse yet, this is a completely silent failure, as the transcript
| _LOOKS_ really good.
|
| Basic info theory shows that there is more information contained
| in 'surprising' chunks of data than in expected ones. These
| systems actively work to substitute 'expected' speech to
| overwrite 'surprising' speech.
|
| The transcript I got was utter trash, multiple pages of errata I
| had to submit when the normal is a couple of lines. And as I
| said, some literally reversed the meaning in a consequential way,
| and yet completely silently.
|
| This kind of silent active failure mode is terrifying. Unless it
| is solved, and I see no way to solve it without removing ALL
| predictive algos from the system, these types of systems must not
| be used in any situation of serious consequence, at least not
| without real redundancy and backup.
| lunixbochs wrote:
| Do you have a demo audio clip for this? I'd be interested to
| see how it looks in practice.
| toss1 wrote:
| Sorry, I don't have anything available.
|
| One item I remember was that I said "Dr Kemeny" in relation
| to Dartmouth College (he was a famous mathematician, invented
| the BASIC programming language and was president of the
| college). It replaced those instances with "Jack Kennedy".
|
| In another instance, I said that "Evidently, you have a
| reading comprehension problem.". It replaced it with
| "Evidently, I have a ...", completely reversing the meaning.
|
| There was zero problems with the microphones or audio, and it
| was not rushed or mumbled talk. There were 80+ other examples
| over a few hours of talking, and some from other speakers.
| And those were just the obvious ones I could catch.
|
| Another massive problem with this technology is that a human
| stenographer can notice when s/he missed something and didn't
| hear and ask the speaker to repeat or clarify what was said,
| and will often during a pause request clarification on
| spelling of names, addresses, etc. In contrast, this "AI"
| technology just barges ahead ASSuming that it knows what it
| is doing and inserts literally whatever sounds good in the
| transcript, completely silent that it doesn't have a clue.
|
| Having seen this up close, I'm of the strong opinion that
| anyone foisting this software on the market without huge
| warnings that this is not usable for any critical functions
| is, basically a fraud. They know or certainly should know
| that these failures not only exist but are common and
| systemic, yet they barge along like it is OK. It is not.
| Tomis02 wrote:
| I've been saying this for years. Current "AI" algorithm are
| fundamentally flawed because they rely on a statistical
| approach. This works moderately well for some use cases but it
| will rarely give you 100% confidence. Good luck with self-
| flying planes or self-running nuclear power plants.
| toss1 wrote:
| >>Current "AI" algorithms are fundamentally flawed because
| they rely on a statistical approach.
|
| YES! The old joke about "Artificial Stupidity" is actually
| more true than anyone realized.
|
| These statistical so-called-AI systems actually work to
| actively REMOVE or sanitize out any unexpected information,
| making it all conform with the EXPECTED results from the
| training set.
|
| This not only REMOVES the most high-information 'surprising'
| or unexpected nuggets, it actively HIDES them. When something
| unexpected comes up, it gets force fit into the expected
| prediction algorithms and output as if it were good.
|
| I'm not saying that there are no useful things that can be
| done with this technology -- there is a LOT of mundane work
| out there to be done.
|
| But, we will never get this type of "AI" saying "Huh, that's
| odd, I wonder why that is?", which is exactly the kind of
| observation that leads a prepared and fertile mind to great
| discoveries.
| sowbug wrote:
| I knew there was a reason why I kept my MP3 library even after
| subscribing to Spotify. Now piping everything through whisper. So
| far the generated lyrics are reasonable, though it thinks the REM
| song says "Linnie Bruce is not afraid."
|
| No surprise that it appears to have successfully transcribed all
| the recordings of Harvard Sentences I could find.
| https://en.wikipedia.org/wiki/Harvard_sentences
| Havoc wrote:
| This could be really cool for mycraft/rasphy etc
| sva_ wrote:
| It seems like Stable AIs release has led to some real disruption
| in the ML field regarding open source, and this doesn't seem to
| be limited to image generation. Excited to see what comes next.
| jasan_s wrote:
| jasan_s wrote:
| hijp wrote:
| Anyone get it running on m1 mac?
|
| I keep getting `ModuleNotFoundError: No module named
| 'setuptools.command.build'`
| simmanian wrote:
| I got it working inside a docker container on my M1 MBP. FWIW,
| I'm having my $180 tinyminimicro PC run a translation task
| while my M1 MBP runs a transcription task with the same audio
| input. So far, the PC is actually outputting results a lot
| faster than the MBP. Interesting results.
| kif wrote:
| I got requirements installed, but then when running the Python
| example, I get:
|
| RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
| kif wrote:
| Probably need to pass some kind of options when initializing.
| The command itself works fine, just shows a warning:
| warnings.warn("FP16 is not supported on CPU; using FP32
| instead")
| mewse-hn wrote:
| using this in the sample code worked for me:
|
| >>> options = whisper.DecodingOptions(fp16=False)
| dceddia wrote:
| Yep, I had this too. `pip3 install -U pip setuptools` took care
| of it. (If you get an error about pip3, try `pip` instead)
| hijp wrote:
| I'm really new to pip, but does this look ok?
|
| (after running the command for setuptools) Defaulting to user
| installation because normal site-packages is not writeable
| Requirement already satisfied: pip in
| /Users/xxx/Library/Python/3.9/lib/python/site-packages
| (22.2.2) Requirement already satisfied: setuptools in
| /Users/xxx/Library/Python/3.9/lib/python/site-packages
| (65.3.0)
|
| ---- after trying whisper installation: x Getting
| requirements to build wheel did not run successfully. | exit
| code: 1 +-> [20 lines of output] Traceback (most recent call
| last): File "/Users/xxx/Library/Python/3.9/lib/python/site-
| packages/pip/_vendor/pep517/in_process/_in_process.py", line
| 363, in <module> main() File
| "/Users/xxx/Library/Python/3.9/lib/python/site-
| packages/pip/_vendor/pep517/in_process/_in_process.py", line
| 345, in main json_out['return_val'] =
| hook(*hook_input['kwargs']) File
| "/Users/xxx/Library/Python/3.9/lib/python/site-
| packages/pip/_vendor/pep517/in_process/_in_process.py", line
| 130, in get_requires_for_build_wheel return
| hook(config_settings) File "/Library/Developer/CommandLineToo
| ls/Library/Frameworks/Python3.framework/Versions/3.9/lib/pyth
| on3.9/site-packages/setuptools/build_meta.py", line 154, in
| get_requires_for_build_wheel return self._get_build_requires(
| File "/Library/Developer/CommandLineTools/Library/Frameworks/
| Python3.framework/Versions/3.9/lib/python3.9/site-
| packages/setuptools/build_meta.py", line 135, in
| _get_build_requires self.run_setup() File "/Library/Developer
| /CommandLineTools/Library/Frameworks/Python3.framework/Versio
| ns/3.9/lib/python3.9/site-packages/setuptools/build_meta.py",
| line 150, in run_setup exec(compile(code, __file__, 'exec'),
| locals()) File "setup.py", line 2, in <module> from
| setuptools_rust import Binding, RustExtension File "/private/
| var/folders/lj/7x6d3dxd3cbdtt484k6xsmyh0000gn/T/pip-build-
| env-ieaydl8r/overlay/lib/python3.9/site-
| packages/setuptools_rust/__init__.py", line 1, in <module>
| from .build import build_rust File "/private/var/folders/lj/7
| x6d3dxd3cbdtt484k6xsmyh0000gn/T/pip-build-env-
| ieaydl8r/overlay/lib/python3.9/site-
| packages/setuptools_rust/build.py", line 23, in <module> from
| setuptools.command.build import build as CommandBuild # type:
| ignore[import] ModuleNotFoundError: No module named
| 'setuptools.command.build' [end of output]
| note: This error originates from a subprocess, and is likely
| not a problem with pip.
|
| error: subprocess-exited-with-error
| dceddia wrote:
| Nope, that doesn't look good! I honestly just googled the
| error and installing setuptools fixed it for me, but I
| barely know anything about the Python ecosystem so I'm
| really just fumbling around here.
| hijp wrote:
| haha same, thanks
| mvexel wrote:
| Not quite sure if this is related, but since there's a
| bunch of statements in there referencing rust: I had to
| install the rust compiler on my Mac (`brew install rust` if
| you use homebrew). This is not mentioned in the
| installation instructions.
| Smaug123 wrote:
| I'm still not successfully using the GPU, but it's working
| decently quickly (with the base model - it's incredibly slow to
| use the Large model) using just the CPU. I'm going to have to
| check what magic stable-diffusion is doing to enable the GPU :(
| dceddia wrote:
| There's a --device flag you can pass. I've been trying to get
| `--device cuda` to work on my Windows machine and it's saying
| that torch wasn't compiled with CUDA. Trying to figure out
| what's going on there.
|
| And on the M1, supposedly PyTorch has support for hardware
| acceleration using MPS (Metal Performance Shaders, announced
| here https://pytorch.org/blog/introducing-accelerated-
| pytorch-tra...) but when I tried `--device mps` it blew up
| with an error "input types 'tensor<1x1280x3000xf16>' and
| 'tensor<1xf32>' are not broadcast compatible".
| magicalhippo wrote:
| > I've been trying to get `--device cuda` to work on my
| Windows machine and it's saying that torch wasn't compiled
| with CUDA.
|
| I struggled with the same. Here's what worked for me:
|
| Use pip to uninstall pytorch first, should be "pip
| uninstall torch" or similar.
|
| Find the CUDA version you got installed[1]. Go to PyTorch
| get started page[2] and use their guide/wizard to generate
| the pip string, and run that. I had to change pip3 to pip
| FWIW, and with Cuda 11.6 installed I ended up with "pip
| install torch torchvision torchaudio --extra-index-url
| https://download.pytorch.org/whl/cu116".
|
| After that I could use --device cuda, and the difference
| was immense. On my 2080Ti it went from roughly an hour for
| a minute with large model, to 10-20 seconds.
|
| [1]: https://stackoverflow.com/a/55717476
|
| [2]: https://pytorch.org/get-started/locally/
| Smaug123 wrote:
| Yep, same for me, on M1 after enabling MPS (with
| `model.to("mps")`) it just either SIGSEGV or SIGABRTs every
| time with that line. The extremely unclean nature of the
| abort is making it hard to debug :(
| dceddia wrote:
| I noticed the size seems to correspond to the model. With
| a large model, the error is tensor<1x1280x3000xf16>. With
| tiny, it's tensor<1x384x3000xf16>, and with medium it's
| tensor<1x1024x3000xf16>. It also seems like a bad thing
| that those are f16's but the "expected" data is f32.
| Smaug123 wrote:
| I'm giving up for the night, but
| https://github.com/Smaug123/whisper/pull/1/files at least
| contains the setup instructions that may help others get
| to this point. Got it working on the GPU, but it's...
| much much slower than the CPU? Presumably due to the
| 'aten::repeat_interleave.self_int' CPU fallback.
|
| Also hitting a nice little PyTorch bug:
|
| > File "/Users/patrick/Documents/GitHub/whisper/whisper/d
| ecoding.py", line 388, in apply logits[:,
| self.tokenizer.encode(" ") + [self.tokenizer.eot]] =
| -np.inf
|
| > RuntimeError: dst_.nbytes() >= dst_byte_offset INTERNAL
| ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pyto
| rch/aten/src/ATen/native/mps/operations/Copy.mm":200,
| please report a bug to PyTorch.
| faizsn wrote:
| Faizan
| nik_s wrote:
| I just tested the model [1] using an RTX3090, trying to translate
| a french text I found here [2].
|
| Some observations:
|
| - The full translation of the 6:22 minute video takes about 22
| seconds (17x real time)
|
| - It recognizes the language by default (and did a good job to
| recognize it was french audio)
|
| - MIT License [3]!
|
| - The quality of the transcription is good, but not perfect.
|
| - The quality of the translation (if you don't consider
| transcription errors as a translation error) is generally very
| good.
|
| ---
|
| The transcription:
|
| > Bonjour a tous, <error>j'suis</error> espere que vous allez
| bien, c''est ENTI. Et aujourd', <error>aujourd',</error> on se
| retrouve <error>un peu physique</error> pour parler de la termo
| dynamique. Vous ne vous inquietez pas, ca va bien se passer. On
| va y aller ensemble, <error>etre a par exemple,</error> je vous
| accompagne a travers une serie de videos pour vous expliquer les
| principes de base en termo dynamique. Et bah, c''est parti, on va
| y aller tranquillement. Lidee, c''est vous puissiez comprendre la
| termo dynamique dans son ensemble. Donc, je vais vraiment prendre
| mon temps pour <error>couplisser</error> bien comprendre les
| notions,
|
| The translation:
|
| > Hello everyone, I hope you're doing well, it's NT and today we
| find ourselves a little physical to talk about the thermo
| dynamic. Don't worry, it's going well, we're going to go together
| and be the same. I'm going to accompany you through a series of
| videos to explain the basic principles in thermo dynamic. Well,
| let's go, <error>we're going to go quietly</error>. The idea is
| that you can understand the thermo dynamic <error>in sound
| together</error>. So I'm really going to take my time to
| understand the notions,
|
| ---
|
| All in all very happy that OpenAI is publishing their models. If
| Stable Diffusion is any guide, people will hack some crazy things
| with this.
|
| [1] https://github.com/openai/whisper [2]
| https://www.youtube.com/watch?v=OFLt-KL0K7Y [3]
| https://github.com/openai/whisper/blob/main/LICENSE
| seszett wrote:
| > _dans son ensemble_
|
| > _in sound together_
|
| That's hilarious and honestly, incredibly bad. "Dans son
| ensemble" is a very common idiom (meaning "as a whole") while
| "in sound together" has to be pretty rare. "Son" means
| "his/hers/its" as well as "sound", and the former meaning is
| probably more common in general so I have no idea how this
| result could arise.
|
| "Termo" also doesn't exist in French, it's "thermo", so the
| transcript even makes orthographic errors.
|
| And I forgot about "couplisser" which is also a hilarious made-
| up word that sounds like it could mean something, but doesn't!
| _Edit_ Google finds exactly one reference of this, in a patent
| with a typo on the word "coulisser".
|
| I'm still impressed by the transcript quality since it covers
| many languages, but the translation part is quite poor.
| StevenWaterman wrote:
| Was this with the `base` model? `large` is running ok on a P100
| in colab, but is about 4% the speed of `base.en`. Certainly
| seems like some of these models will be fast enough for real-
| time.
| NaturalPhallacy wrote:
| How did you get it to use the GPU?
|
| I have it running right now and it's not touching the GPU.
| ramblerman wrote:
| --device "cuda"
| NaturalPhallacy wrote:
| My version of pytorch didn't have CUDA. I had to install
| conda to get it, and now it's currently installing.
|
| Whatever the default version that `pip install
| git+https://github.com/openai/whisper.git` grabbed didn't
| include it by default.
| joshcryer wrote:
| It also runs well on a CPU and seems to have proper memory
| management. Wonderful timing because I was using DeepSpeech for
| some audio recordings and it required me to script up a
| splitter to make the files into .wav and then do snippets of 10
| seconds each. Everything about this just works out of the box.
| On a core i5 I'm getting about 30 seconds every minute.
| Transcriptionist jobs just turned into editor jobs. I love how
| it drops the inflections in the audio as well, because it was
| trained on transcription work, and that is one of the first
| things you learn to do (drop the uhs and ums and huhs etc,
| unless it is a strictly verbose transcription).
| solarmist wrote:
| Is it translation or transcription? Or both?
|
| Both, wow. This is really interesting.
| StevenWaterman wrote:
| Both, the blog covers it in detail. Pass in audio in any
| language, and get an English transcription out.
| nik_s wrote:
| It can do both - I've edited my original post to show the
| translation task.
| gok wrote:
| Comparing this model's word error rates to the state of the art
| [1] on a few common test sets:
| Whisper SoTA LibriSpeech test-clean 2.7% 1.8%
| LibriSpeech test-other 5.6% 2.9% Switchboard
| 13.1% 4.9% CallHome 15.8% 9.5%
|
| The authors do explicitly state that they're trying to do a lot
| of fancy new stuff here, like be multilingual, rather than
| pursuing just accuracy.
|
| [1] https://github.com/syhw/wer_are_we
| lunixbochs wrote:
| I suspect Whisper is more robust than other "SOTA" models, but
| this release is likely leaving a fair bit of accuracy on the
| table considering the amount of resources OpenAI is capable of
| throwing at training it.
|
| Comparing the readily available test sets from the paper to
| some of my personal robust models (for the Talon models, this
| is greedy decoding, no language model):
| Talon Talon Talon Whisper wav2vec 2.0
| 28M 300M 1B Large 960h librispeech clean
| 3.21 2.52 2.40 2.7 2.7 librispeech other
| 8.21 6.56 5.63 5.6 6.2 common voice
| 13.88 11.65 8.86 9.5 29.9 tedlium
| 7.51 6.55 5.47 4.0 10.5
|
| I have a battery of more difficult tests on hand (including
| adversarial tests, and diverse accent-specific metrics). I'll
| look at running these tests on each of the Whisper model sizes
| and following up with a larger comparison.
| allanrbo wrote:
| Talon was the first thing that came to my mind when I saw
| this news. Would be nice if it could benefit from Whisper.
| (Big fan of your work on Talon!)
| ma2rten wrote:
| I'm looking forward to your comparison. It's really hard to
| make sense of how good this model actually is without being
| an expert in the area.
| nshm wrote:
| It is interesting how they compare with wav2vec2 instead of
| nemo conformer (which is more accurate) in Table 2.
| StevenWaterman wrote:
| One of the things they point out is that the SoTA on e.g.
| LibriSpeech is _only_ good at LibriSpeech, and doesn 't
| generalise as well.
|
| > Because Whisper was trained on a large and diverse dataset
| and was not fine-tuned to any specific one, it does not beat
| models that specialize in LibriSpeech performance, a famously
| competitive benchmark in speech recognition. However, when we
| measure Whisper's zero-shot performance across many diverse
| datasets we find it is much more robust and makes 50% fewer
| errors than those models.
| lunixbochs wrote:
| My own experience agrees: the generally available "SOTA"
| models are not especially robust, and can be _extremely_ bad
| (>50% absolute error rate) at some tasks. I'll post some
| preliminary numbers in a sibling comment and look into
| running my full set of tests on Whisper.
|
| It looks like Whisper is probably leaving a lot of accuracy
| on the table, but initially it does seem to be a lot more
| robust than general "SOTA" models.
|
| For a quick comparison, Silero's accuracy charts are kind of
| nice because they post results for a large variety of
| datasets. Scroll down to the EN V6 xlarge EE model (not the
| xlarge CE) [1]
|
| [1] https://github.com/snakers4/silero-models/wiki/Quality-
| Bench...
| wodenokoto wrote:
| Is it also a translation model? All the example transcripts are
| in English, regardless of the language of the purportedly
| transcribed audio.
|
| The description makes it sound like it is a model for
| transcribing English audio.
|
| > We've trained and are open-sourcing a neural net called Whisper
| that approaches human level robustness and accuracy on English
| speech recognition.
| michelb wrote:
| Quite a high error rate on a very clean-spoken Dutch audio, but
| way better than anything else I have tried.
| jawadch93 wrote:
| LanternLight83 wrote:
| Hoping to see this out to use in open source voice assistants,
| eg. mycroft
| sn41 wrote:
| Most of the comments here are about law enforcement. I would like
| to point out that it might be a boon for dictation software. This
| may make it easier to dictate text/code etc. in any environment.
| liminalsunset wrote:
| I really wish I had this about half a year ago when I was
| building a tool to automatically turn online school lectures into
| searchable, clickable transcripts (kind of like YouTube or EdX
| transcripts).
|
| I was originally using Adobe Premiere Pro's speech to text to do
| it, and wrote Python to convert its output to the Hyperaudio
| format on GitHub. With this, I can totally skip all of that step
| and this is fully open source, too.
|
| App idea:
|
| Build an app that takes a video and uses Hyperaudio or a similar
| project to add a clickable and searchable transcript (clicking in
| transcript seeks video)
| resoluteteeth wrote:
| You could already do the speech recognition in a fully open
| source way with vosk easily, although Whisper may be more
| accurate
| txtai wrote:
| Check out this notebook for an example on how to run Whisper as a
| txtai pipeline in Python or as an API service:
| https://colab.research.google.com/github/neuml/txtai/blob/ma...
| z3t4 wrote:
| Why not make a demo that you can try out via
| navigator.mediaDevices.getUserMedia . Of course you will get good
| results if you demo using the training set.
| synergy20 wrote:
| is there a high quality text to speech equivalent project like
| this?
| bergenty wrote:
| Seriously, when I first landed on the page without reading
| anything else I thought it was text to speech with the "micro
| machine" example and I was floored. The speech to text is
| obviously mind blowing too.
| throwamon wrote:
| Is it feasible to use this for Talon-like voice-driven computer
| usage?
| lunixbochs wrote:
| If the Whisper models provide any benefits over the existing
| Talon models, and if it's possible to achieve any kind of
| reasonable interactive performance, I will likely integrate
| Whisper models into Talon.
|
| Talon's speech engine backend is modular, with Dragon, Vosk,
| the WebSpeech API, and Talon's own engine all used in different
| ways by users.
| FloatArtifact wrote:
| Maybe, a number of speech recognition engines have been
| integrated into https://github.com/dictation-toolbox/dragonfly
| dubeye wrote:
| I know a manual transcription company, which is still seeing
| modest growth from existing clients who also use ASR, so it's not
| quite there yet
| londons_explore wrote:
| I wonder how much the 30 second window is impacting performance?
|
| Anecdotally, I feel like there are plenty of times that I need
| context from more than 30 seconds ago to understand some
| technical jargon that's being discussed.
| pmarreck wrote:
| So it's 100% better than Siri's speech dictation, I see
| eugenhotaj wrote:
| Now someone just needs to pipe the output into stable diffusion.
| chrisstanchak wrote:
| Hold on to your papers
| smusamashah wrote:
| How well does it do for technical and domain oriented speech? For
| example I have audio recordings of a senior explaining some very
| technical aspects of our software. Will it understand the
| technical terms in that speech?
|
| I guess I will need to download and run on it to see how correct
| it is.
| emcq wrote:
| Be wary of using this model - the licensing of this model seems
| sketchy. Several of the datasets used for training like WSJ and
| TED-LIUM have clear non-commercial clauses. I'm not a lawyer but
| releasing a model as "MIT" seems dubious, and hopefully OpenAI
| has paid for the appropriate licenses during training as they are
| no longer a research-only non profit.
| nshm wrote:
| I think they didn't use WSJ for training, only for evaluation.
| Paper includes WSJ under "Evaluation datasets"
| jefftk wrote:
| This is a big dispute right now: OpenAI and other AI companies
| generally take the position that models learning from data does
| not make the output of the models a derivative work of that
| data. For example, GitHub Co-pilot uses all publicly available
| GitHub code regardless of license, and
| DALLE-2/StableDiffusion/etc use lots of non-free images. I
| don't think this has been challenged in court yet, and I'm very
| curious to see what happens when it is.
| petercooper wrote:
| I think it might be even less problematic with something like
| Whisper than with DALLE/SD? Merely consuming data to train a
| system or create an index is not usually contrary to the law
| (otherwise Google wouldn't exist) - it's the _publication_ of
| copyright content that 's thorny (and is something you can
| begin to achieve with results from visual models that include
| Getty Photos logo, etc.)
|
| I think it'd be a lot harder to make a case for an accurate
| audio to text transcription being seen to violate the
| copyright of any of the training material in the way a visual
| could.
| jefftk wrote:
| They're not just training a system but publishing the
| trained system
| emcq wrote:
| This is even slightly more direct: access to WSJ data
| requires paying LDC for the download, and the pricing varies
| depending on what institution / license you're from. The cost
| may be a drop in the bucket compared to compute, but I don't
| know that these licenses are transferable to the end product.
| We might be a couple court cases away from finding out but I
| wouldn't want to be inviting one of those cases :)
| bscphil wrote:
| > models learning from data does not make the output of the
| models a derivative work of that data
|
| Most of the debate seems to be happening on the question of
| whether _everything_ produced by models trained on
| copyrighted work represents a derivative work. I argue that
| at the very least _some_ of it does; so the claim said to be
| made by the AI companies (see quote above) is clearly a false
| one.
|
| We're in a weird place now where AI is able to generate "near
| verbatim" work in a lot of cases, but I don't see an obvious
| case for treating this any differently than a human
| reproducing IP with slight modifications. (I am not a
| lawyer.)
|
| For example, copyright law currently prevents you from
| selling a T-shirt with the character Spider-Man on it. But
| plenty of AI models can give you excellent depictions of
| Spider-Man that you could put on a T-shirt and try to sell.
| It's quite silly to think that any judge is going to take you
| seriously when you argue that your model, which was trained
| on a dataset that included pictures of Spider-Man, and was
| then asked to output images using "Spider-Man" as a search
| term, has magically circumvented copyright law.
|
| (I think there's a valid question about whether models
| represent "derivative work" in the GPL sense specifically,
| but I'm using the idea more generally here.)
| jefftk wrote:
| That's right: the model is definitely capable of creating
| things that are clearly a derivative work of what they were
| trained on. But this still leaves two questions:
|
| * Does the model require a copyright license? Personally I
| think it's very likely a derivative work, but that doesn't
| necessarily mean you need a license. The standard way this
| works in the US is the four factors of fair use
| (https://copyright.columbia.edu/basics/fair-use.html) where
| Factor 1 is strongly in favor of the model being
| unrestricted while 2-4 are somewhat against (and in some
| cases 4 is strongly against).
|
| * Is all output from the model a derivative work of all of
| the input? I think this is pretty likely no, but unclear.
|
| * Does the model reliably only emit derivative works of
| specific inputs when the user is trying to get it to do
| that? Probably no, which makes using one of these models
| risky.
|
| (Not a lawyer)
| zeagle wrote:
| It would be exceptional to get a healthy competitor to
| microsoft/nuance's dragon monopoly on voice recognition in
| healthcare. At a couple thousand bucks a license and the more
| recent SaaS subscription trend there is a lot of money to be made
| in that space.
| darkpicnic wrote:
| I just wrote a script with Hazel to automatically transcribe my
| voice notes to txt. It handles punctuation extremely well. What a
| wonderful contribution!
| pbassham wrote:
| Exactly what I was planning to do! Want to share yours with me?
| braindead_in wrote:
| Why build a separate model when you can integrate it right into
| GPT?
| harry8 wrote:
| Can you plug this into a computer on your premises to get speech
| recognition without amazon, apple or google's cloud (or any other
| cloud) involvement?
|
| Right now I decline all speed recognition because I don't want
| orwellian listening devices in my house or pocket and haven't
| seen an answer. (Also haven't been too bothered about speech
| command interfaces to bother with a load of research - lazy me).
| fragmede wrote:
| Yes, after the download of the model weights (from
| https://openaipublic.azureedge.net/) it's an entirely offline
| process.
| abidlabs wrote:
| Here [1] is a video tutorial on building a web UI that accepts
| microphone input and runs it through Whisper for speech
| transcription
|
| [1]
| https://www.youtube.com/watch?v=ywIyc8l1K1Q&ab_channel=1litt...
| amrrs wrote:
| Thank you for sharing!
___________________________________________________________________
(page generated 2022-09-22 23:02 UTC)