[HN Gopher] StyleTTS2 - open-source Eleven-Labs-quality Text To ...
___________________________________________________________________
StyleTTS2 - open-source Eleven-Labs-quality Text To Speech
Author : sandslides
Score : 354 points
Date : 2023-11-19 17:40 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| sandslides wrote:
| Just tried the collab notebooks. Seems to be very good quality.
| It also supports voice cloning.
| fullstackchris wrote:
| Great stuff, took a look through the README but... what are the
| minimum hardware requirements to run this? Is this gonna blow
| up my CPU / harddrive?
| sandslides wrote:
| Not sure. The only inference demos are colab notebooks. The
| models are approx 700mb each so I imagine it will run on
| modest gpu
| bbbruno222 wrote:
| Would it run in a cheap non-GPU server?
| dmw_ng wrote:
| Seems to run about "2x realtime" on 2015 4 core i7-6700HQ
| laptop, that is, 5 seconds to generate 10 seconds of
| output. Can imagine that being 4x or greater on a real
| machine
| thot_experiment wrote:
| I skimmed the github but didn't see any info on this, how long
| does it take to finetune to a particular voice?
| progbits wrote:
| > MIT license
|
| > Before using these models, you agree to [...]
|
| No, this is not MIT. If you don't like MIT license then feel free
| to use something else, but you can't pretend this is open source
| and then attempt to slap on additional restrictions on how the
| code can be used.
| sandslides wrote:
| Yes, I noticed that. Doesn't seem right does it
| weego wrote:
| I think you mis-parsed the disclaimer. It's just warning people
| that cloned voices come with a different set of rights to the
| software (because the person the voice is a clone of has rights
| to their voice).
| chrismorgan wrote:
| (Don't let's derail the conversation, please, but
| "disclaimer" is completely the wrong word here. This is a
| condition of use. A disclaimer is "this isn't mine" or "I'm
| not responsible for this". Disclaimers and disclosures are
| quite different things and commonly confused, but this isn't
| even either of them.)
| gosub100 wrote:
| This always annoys me when people put "disclaimers" on
| their posts. IANAL, so tired of hearing that one. It's
| pointless because even if you _were_ a lawyer, you cannot
| meaningfully comment on a case without the details,
| jurisdiction, circumstance, etc. Next, it 's meaningless
| because is anyone going to blindly bow down and obey if you
| state the opposite? "Yes, I AM a lawyer, you do not need to
| pay taxes, they are unconstitutional." Thirdly, when they
| "disclaimer" themselves as working at google, that's not a
| _dis_ -claimer, thats a "claimer", asserting the
| affirmative. I know their companies require them to not
| speak for the company without permission, but I hardly ever
| hear that one, usually its just some useless self-
| disclosure that they might be biased because they work
| there. Ok, who isn't biased?
|
| What bugs me overall is that it's usually vapid mimicry of
| a phrase they don't even understand.
| nielsole wrote:
| Ianal, but giving legal advice without being a lawyer may
| be illegal in some jurisdictions. Not sure if the
| disclaimer is effective or was ever tested in court. The
| disclaimer/disclosure mix-up is super annoying, but
| disclosing obvious biases even if not legally required
| seems like good practice to me.
| gpm wrote:
| As I understand it the source code is licensed MIT, the weights
| are licensed "weird proprietary license that doesn't explicitly
| grant you any rights and implicitly probably grants you some
| usage rights so long as you tell the listeners or have
| permission from the voice you cloned".
|
| Which, if you think the weights are copyright-able in the first
| place, makes them practically unusable for anything
| commercial/that you might get sued over because relying on a
| vague implicit license is definitely not a good idea.
| ronsor wrote:
| And if you don't think weights are copyrightable, it means
| nothing at all.
| IshKebab wrote:
| I think that's referring to the pre-trained models, not the
| source code.
| ericra wrote:
| This bothered me as well. I opened an issue on the repo asking
| them to consider updating the license file to reflect these
| additional requirements.
|
| The wording they currently use suggests that this additional
| license requirement applies not only to their pre-trained
| models.
| pdntspa wrote:
| As if anyone outside of corporate legal actually cares
| mlsu wrote:
| We're now at "free, local, AI friend that you can have
| conversations with on consumer hardware" territory.
|
| - synthesize an avatar using stablediffusion
|
| - synthesize conversation with llama
|
| - synthesize the voice with this text thing
|
| soon
|
| - VR
|
| - Video
|
| wild times!
| jpeter wrote:
| Which consumer gpu runs llama 70B?
| sroussey wrote:
| Prosumer gear.
|
| MacBook Pro M3 Max.
| mlsu wrote:
| A Mac with a lot of unified RAM can do it, or a dual
| 3090/4090 setup gets you 48gb of VRAM.
| jadbox wrote:
| Does this actually work? I had thought that you can't use
| SLI to increase your net memory for the modal?
| speedgoose wrote:
| It works. I use ollama these days, with litellm for the
| api compatibility, and it seems to use both 24GB GPUs on
| the server.
| benjaminwootton wrote:
| I've got a 64gb Mac M2. All of the openllm models seem to
| hang on startup or on API calls. I got them working through
| GCP colab. Not sure if it's a configuration issue or if the
| hardware just isn't up to it?
| benreesman wrote:
| Valiant et al work great on my 64Gb Studio at Q4_K_M.
| Happy to answer questions.
| wahnfrieden wrote:
| Try llama.cpp with Metal (critical) and GGUF models from
| TheBloke
|
| Or wait another month or so for https://ChatOnMac.com
| brucethemoose2 wrote:
| A single 3090, or any 24GB GPU. Just barely.
|
| Yi 34B is a much better fit. I can cram 75K context onto 24GB
| without brutalizing the model with <3bpw quantization, like
| you have to do with 70B for 4K context.
| speedgoose wrote:
| Can it produce any meaningful outputs with such an extreme
| quantisation?
| brucethemoose2 wrote:
| Yeah, quite good actually, especially if you quantize it
| on text close to what you are trying to output.
|
| Llama 70B is a huge compromise at 2.65bpw... This does
| make the much "dumber." Yi 34B is much better, as you can
| quantize it at ~4bpw and still have a huge context.
| lossolo wrote:
| How would you compare mistral-7b-instruct 16fp (or
| similar 7b/13b model like llama2 etc) to Yi-34b
| quantized?
| Hamcha wrote:
| Yup, and you can already mix and match both local and cloud AIs
| with stuff like SillyTavern/RealmPlay if you wanna try what the
| experience is like, people have been using it to roleplay for a
| while.
| cloudking wrote:
| Would be great to have a local home assistant voice interface
| with this + llama + whisper.
| trafficante wrote:
| Seems like a fun afternoon project to get this hooked into one
| of the Skyrim TTS mods. I previously messed around with
| elevenlabs, but it had too much latency and would be somewhat
| expensive long term so I'm excited to try local and free.
|
| I'm sure I have a lot of reading up to do first, but is it a
| safe assumption that I'd be better served running this on an m2
| mbp rather than tax out my desktop's poor 3070 running it on
| top of Skyrim VR?
| godelski wrote:
| Why name it Style<anything> if it isn't a StyleGAN? Looks like
| the first one wasn't either. Interesting to see moves away from
| flows, especially when none of the flows were modern.
|
| Also, is no one clicking on the audio links? There are some...
| questionable ones... and I'm pretty sure lots of mistakes.
| gwern wrote:
| > Looks like the first one wasn't either.
|
| The first one says it uses AdaIN layers to help control style?
| https://arxiv.org/pdf/2205.15439.pdf#page=2 Seems as
| justifiable as the original StyleGAN calling itself StyleX...
| godelski wrote:
| See my other comment. StyleGAN isn't about AdaIN. StyleGAN2
| even modified it.
| lhl wrote:
| It's not called a GAN TTS right? StyleGAN is called what it is
| because of a "style-based" approach and StyleTTS/2 seems to be
| doing the same (applying style transfer) through different
| method (and disentangling style from the rest of the voice
| synthesis).
|
| (Actually, looked at the original StyleTTS paper and it
| actually even partially uses AdaIN in the decoder, which is the
| same way that StyleGAN injected style information? Still, I
| think is besides the point for the naming.)
| godelski wrote:
| Yeah no I get this but the naming convention has become so
| prolific that anyone working in generative space hears
| "Style<thing>" and you should think "GAN". (I work in
| generative vision btw)
|
| My point is not that it is technically right, it is that the
| name is strongly related with the concept now. Such that if
| you use a style based network and don't name it StyleX that
| it's odd and might look like you're trying to claim you've
| done more. Not that there aren't plenty of GANs that are
| using Karras's code and called something else.
|
| > AdaIN
|
| Yes, StyleGAN (version 1) uses AdaIN but StyleGAN2 (and
| beyond) doesn't. AdaIN stands for Adaptive Instance
| Normalization. While they use it in that network, to be
| clear, they did not invent AdaIN and the technique isn't
| explicit to style, it's a normalization technique. One that
| StyleGAN2 modifies because the standard one creates strong
| and localized spikes in the statistics which results in image
| artifacts.
| lhl wrote:
| So what I'm hearing is... no one should use "style" in its
| name anymore to describe style transfers because it's too
| closely associated with a set of models in a sub-field that
| uses a different concept to apply style that used "style"
| in its name, unless it also uses that unrelated concept in
| its implementation? Is that the gist of it, because that
| sounds a bit mental.
|
| (I'm half kidding, I get what you mean, but also, think
| about it. The alternative is worse.)
| api wrote:
| It should be pretty easy to make training data for TTS. The
| Whisper STT models are open so just chop up a ton of audio and
| use Whisper to annotate it, then train the other direction to
| produce audio from text. So you're basically inverting Whisper.
| eginhard wrote:
| STT training data includes all kinds of "noisy" speech so that
| the model learns to recognise speech in any conditions. TTS
| training data needs to be as clean as possible so that you
| don't introduce artefacts in the output and this high-quality
| data is much harder to get. A simple inversion is not really
| feasible or at least requires filtering out much of the data.
| satvikpendem wrote:
| Funnily enough, the TTS2 examples sound _better_ than the ground
| truth [0]. For example, the "Then leaving the corpse within the
| house [...]" example has the ground truth pronounce "house"
| weirdly, with some change in the tonality that sounds higher, but
| the TTS2 version sounds more natural.
|
| I'm excited to use this for all my ePub files, many of which
| don't have corresponding audiobooks, such as a lot of Japanese
| light novels. I am currently using Moon+ Reader on Android which
| has TTS but it is very robotic.
|
| [0] https://styletts2.github.io/
| risho wrote:
| how are you planning on using this with epubs? i'm in a similar
| boat. would really like to leverage something like this for
| ebooks.
| satvikpendem wrote:
| I wonder if you can add a TTS engine to Android as an app or
| plugin, then make Moon+ Reader or another reader to use that
| custom engine. That's probably how I'd do it for the easiest
| approach, but if that doesn't work, I might just have to make
| my own app.
| a_wild_dandan wrote:
| I'm planning on making a self-host solution where you can
| upload files and the host sends back the audio to play, as
| a first pass on this tech. I'll open source the repo after
| fiddling and prototyping. I've needed this kinda thing for
| a long time!
| risho wrote:
| Please make sure to link it back to HN so that we can
| check it out!
| jrpear wrote:
| You can! [rhvoice](https://rhvoice.org/) is an open source
| example.
| KolmogorovComp wrote:
| The pace is better, but imho you there is still a very
| noticeable "metalic" tone which makes it inferior to the real
| thing.
|
| Impressive results nonetheless, and superior to all other TTS.
| lhl wrote:
| I tested StyleTTS2 last month, my step-by-step notes that might
| be useful for people doing local setup (not too hard):
| https://llm-tracker.info/books/howto-guides/page/styletts-2
|
| Also I did a little speed/quality shootoff with the LJSpeech
| model (vs VITS and XTTS). StyleTTS2 was pretty good and very
| fast: https://fediverse.randomfoo.net/notice/AaOgprU715gcT5GrZ2
| kelseyfrog wrote:
| > inferences at up to 15-95X (!) RT on my 4090
|
| That's incredible!
|
| Are infill and outpainting equivalents possible? Super-RT TTS
| at this level of quality opens up a diverse array of uses esp
| for indie/experimental gamedev that I'm excited for.
| refulgentis wrote:
| Not sure what you mean: If you mean could inpainting and out
| painting with image models be faster, its a "not even wrong"
| question, similar to asking if the United Airlines app could
| get faster because American Airlines did. (Yes, getting
| faster is an option available to ~all code)
|
| If you mean could you inpaint and outpaint text...yes, by
| inserting and deleting characters.
|
| If you mean could you use an existing voice clip to generate
| speech by the same speaker in the clip, yes, part of the
| article is demonstrating generating speech by speakers not
| seen at training time
| pedrovhb wrote:
| I'm not sure I understand what you mean to say. To me it's
| a reasonable question asking whether text to speech models
| can complete a missing part of some existing speech audio,
| or make it go on for longer, rather than only generating
| speech from scratch. I don't see a connection to your
| faster apps analogy.
|
| Fwiw, I imagine this is possible, at least to some extent.
| I was recently playing with xtts and it can generate
| speaker embeddings from short periods of speech, so you
| could use those to provide a logical continuation to
| existing audio. However, I'm not sure it's possible or easy
| to manage the "seams" between what is generated and what is
| preexisting very easily yet.
|
| It's certainly not a misguided question to me. Perhaps you
| could be less curt and offer your domain knowledge to
| contribute to the discussion?
|
| Edit: I see you've edited your post to be more informative,
| thanks for sharing more of your thoughts.
| refulgentis wrote:
| It imposes a cost on others when when you makes false
| claims like I said or felt the question was unreasonable.
|
| I didn't and don't.
|
| It is a hard question to understand and an interesting
| mind-bender to answer.
|
| Less policing of the metacontext and more focusing on the
| discussion at hand will help ensure there's interlocutors
| around to, at the very least, continue policing.
| kelseyfrog wrote:
| Ignore the speed comment; it is unrelated to my question.
|
| What I mean is, can output be conditioned on antecedent
| audio as well as text analogous to how image diffusion
| models can condition inpainting and outpatient on static
| parts of an image and clip embeddings?
| refulgentis wrote:
| Yes, the paper and Eleven Labs have a major feature of
| "given $AUDIO_SET, generate speech for $TEXT in the same
| style of $AUDIO_SET"
|
| No, in that, you can't cut it at an arbitrary midword
| point, say at "what tim" in "what time is it bejing", and
| give it the string "what time is it in beijing", and have
| it recover seamlessly.
|
| Yes, in that, you can cut it at an arbirtrary phoneme
| boundary, say 'this, I.S. a; good: test! ok?' in IPA is
| 'd'Is, ,aI,es'eI; g'Ud: t'est! ,oUk'eI?', and I can cut
| it 'between' a phoneme, give it the and have it complete.
| kelseyfrog wrote:
| Perfect! Thank you
| huac wrote:
| It is theoretically possible to train a model that, given
| some speech, attempts to continue the speech, e.g. Spectron:
| https://michelleramanovich.github.io/spectron/spectron/.
| Similarly, it is possible to train a model to edit the
| content, a la Voicebox:
| https://voicebox.metademolab.com/edit.html.
| jasonjmcghee wrote:
| I've been playing with XTTSv2 and on my 3080ti, and it's sightly
| faster than the length of the final audio. It's also good
| quality, but these samples sound better.
|
| Excited to try it out!
| gjm11 wrote:
| HN title at present is "StyleTTS2 - open-source Eleven Labs
| quality Text To Speech". Actual title at the far end doesn't name
| any particular other product; arXiv paper linked from there
| doesn't mention Eleven Labs either. I thought this sort of
| editorializing was frowned on.
| stevenhuang wrote:
| Eleven Labs is the gold standard for voice synthesis. There is
| nothing better out there.
|
| So it is extremely notable for an open source system to be able
| to approach this level of quality, which is why I'd imagine
| most would appreciate the comparison. I know it caught my
| attention.
| lucubratory wrote:
| OpenAI's TTS is better than Eleven Labs, but they don't let
| you train it to have a particular voice out of fear of the
| consequences.
| huac wrote:
| I concur that, for the use cases that OpenAI's voices
| cover, it is significantly better than Eleven.
| GaggiX wrote:
| Yes, it's against the guidelines. In fact, when I read the
| title, I didn't think it was a new research paper but a random
| GitHub project.
| modeless wrote:
| It is editorializing and it is an exaggeration. However I've
| been using StyleTTS2 myself and IMO it is the best open source
| TTS by far and definitely deserves a spot on the top of HN for
| a while.
| stevenhuang wrote:
| I really want to try this but making the venv to install all the
| torch dependencies is starting to get old lol.
|
| How are other people dealing with this? Is there an easy way to
| get multiple venvs to share like a common torch venv? I can do
| this manually but I'm wondering if there's a tool out there that
| does this.
| wczekalski wrote:
| I use nix to setup the python env (python version + poetry +
| sometimes python packages that are difficult to install with
| poetry) and use poetry for the rest.
|
| The workflow is: > nix flake init -t
| github:dialohq/flake-templates#python > nix develop -c
| $SHELL > # I'm in the shell with poetry env, I have a
| shell hook in the nix devenv that does poetry install and
| poetry activate.
| lukasga wrote:
| Can relate to this problem a lot. I have considered starting
| using a Docker dev container and making a base image for shared
| dependencies which I then can customize in a dockerfile for
| each new project, not sure if there's a better alternative
| though.
| eurekin wrote:
| Same here. I'm using conda and eyeing simply installing a
| pytorch into the base conda env
| lhl wrote:
| I don't think "base" works like that (while it can be a
| fallback for some dependencies, afaik, Python packages are
| isolated/not in path). But even if you could, don't do it.
| Different packages usually have different pytorch
| dependencies (often CUDA as well) and it will definitely bite
| you.
|
| The biggest optimization I've found is to use mamba for
| everything. It's ridiculously faster than conda for package
| resolution. With everything cached, you're mostly just
| waiting for your SSD at that point.
|
| (I suppose you _could_ add the base env 's lib path to the
| end of your PYTHONPATH, but that sounds like a sure way to
| get bitten by weird dependency/reproducibility issues down
| the line.)
| stavros wrote:
| I generally try to use Docker for this stuff, but yeah, it's
| the main reason why I pass on these, even though I've been
| looking for something like this. It's just too hard to figure
| out the dependencies.
| victorbjorklund wrote:
| This only works for English voices right?
| e12e wrote:
| No? From the readme:
|
| In Utils folder, there are three pre-trained models:
| ASR folder: It contains the pre-trained text aligner, which was
| pre-trained on English (LibriTTS), Japanese (JVS), and Chinese
| (AiShell) corpus. It works well for most other languages
| without fine-tuning, but you can always train your own text
| aligner with the code here: yl4579/AuxiliaryASR.
| JDC folder: It contains the pre-trained pitch extractor, which
| was pre-trained on English (LibriTTS) corpus only. However, it
| works well for other languages too because F0 is independent of
| language. If you want to train on singing corpus, it is
| recommended to train a new pitch extractor with the code here:
| yl4579/PitchExtractor. PLBERT folder: It contains
| the pre-trained PL-BERT model, which was pre-trained on English
| (Wikipedia) corpus only. It probably does not work very well on
| other languages, so you will need to train a different PL-BERT
| for different languages using the repo here: yl4579/PL-BERT.
| You can also replace this module with other phoneme BERT models
| like XPhoneBERT which is pre-trained on more than 100
| languages.
| modeless wrote:
| Those are just parts of the system and don't make a complete
| TTS. In theory you could train a complete StyleTTS2 for other
| languages but currently the pretrained models are English
| only.
| svapnil wrote:
| How fast is inference with this model?
|
| For reference, I'm using 11Labs to synthesize short messages -
| maybe a sentence or something, using voice cloning, and I'm
| getting it at around 400 - 500ms response times.
|
| Is there any OS solution that gets me to around the same
| inference time?
| wczekalski wrote:
| It depends on hardware but IIRC on V100s it took 0.01-0.03s for
| 1s of audio.
| eigenvalue wrote:
| Was somewhat annoying to get everything to work as the
| documentation is a bit spotty, but after ~20 minutes it's all
| working well for me on WSL Ubuntu 22.04. Sound quality is very
| good, much better than other open source TTS projects I've seen.
| It's also SUPER fast (at least using a 4090 GPU).
|
| Not sure it's quite up to Eleven Labs quality. But to me, what
| makes Eleven so cool is that they have a large library of high
| quality voices that are easy to choose from. I don't yet see any
| way with this library to get a different voice from the default
| female voice.
|
| Also, the real special sauce for Eleven is the near instant voice
| cloning with just a single 5 minute sample, which works
| shockingly (even spookily) well. Can't wait to have that all
| available in a fully open source project! The services that
| provide this as an API are just too expensive for many use cases.
| Even the OpenAI one which is on the cheaper side costs ~10 cents
| for a couple thousand word generation.
| wczekalski wrote:
| have you tested longer utterances with both ElevenLabs and with
| StyleTTS? Short audio synthesis is a ~solved problem in the TTS
| world but things start falling apart once you want to do
| something like create an audiobook with text to speech.
| wingworks wrote:
| I can say that the paid service from ElevenLabs can do long
| form TTS very well. I used it for a while to convert long
| articles to voice to listen to later instead of reading. It
| works very well. I only stopped because it gets a little
| pricey.
| wczekalski wrote:
| One thing I've seen done for style cloning is a high quality
| fine tuned TTS -> RVC pipeline to "enhance" the output. TTS for
| intonation + pronunciation, RVC for voice texture. With
| StyleTTS and this pipeline you should get close to ElevenLabs.
| eigenvalue wrote:
| I suspect they are doing many more things to make it sounds
| better. I certainly hope open source solutions can approach
| that level of quality, but so far I've been very
| disappointed.
| sandslides wrote:
| The LibriTTS demo clones unseen speakers from a five second or
| so clip
| eigenvalue wrote:
| Ah ok, thanks. I tried the other demo.
| eigenvalue wrote:
| I tried it. Sounds absolutely nothing like my voice or my
| wife's voice. I used the same sample files as I used 2 days
| ago on the Eleven Labs website, and they worked flawlessly
| there. So this is very, very far from being close to
| "Eleven Labs quality" when it comes to voice cloning.
| sandslides wrote:
| The speech generated is the best I've heard from an open
| source model. The one test I made didn't make an exact
| clone either but this is still early days. There's likely
| something not quite right. The cloned voice does speak
| without any artifacts or other weirdness that most TTS
| systems suffer from.
| thot_experiment wrote:
| Ah that's disappointing, have you tried
| https://git.ecker.tech/mrq/ai-voice-cloning ? I've had
| decent results with that, but inference is quite slow.
| jsjmch wrote:
| ElevenLabs are based on Tortoise-TTS which was already
| pre-trained on millions of hours of data, but this one
| was only trained on LibriTTS which was 500 hours at best.
| If you have seen millions of voices, there are definitely
| gonna be some of them that sound like you. It is just a
| matter of training data, but it is very difficult to have
| someone collect these large amounts of data and train on
| it.
| eigenvalue wrote:
| To save people some time, this is tested on Ubuntu 22.04
| (google is being annoying about the download link, saying too
| many people have downloaded it in the past 24 hours, but if you
| wait a bit it should work again): git clone
| https://github.com/yl4579/StyleTTS2.git cd StyleTTS2
| python3 -m venv venv source venv/bin/activate
| python3 -m pip install --upgrade pip python3 -m pip
| install wheel pip install -r requirements.txt pip
| install phonemizer sudo apt-get install -y espeak-ng
| pip install gdown gdown https://drive.google.com/uc?id=1K
| 3jt1JEbtohBLUA0X75KLw36TW7U1yxq 7z x Models.zip rm
| Models.zip gdown https://drive.google.com/uc?id=1jK_VV3Tn
| GM9dkrIMsdQ_upov8FrIymr7 7z x Models.zip rm
| Models.zip pip install ipykernel pickleshare nltk
| SoundFile python -c "import nltk; nltk.download('punkt')"
| pip install --upgrade jupyter ipywidgets librosa python
| -m ipykernel install --user --name=venv --display-name="Python
| (venv)" jupyter notebook
|
| Then navigate to /Demo and open either
| `Inference_LJSpeech.ipynb` or `Inference_LibriTTS.ipynb` and
| they should work.
| Evidlo wrote:
| What's a ballpark estimate for inference time on a modern CPU?
| beltsazar wrote:
| If AI will render some jobs obsolete, I suppose the first one
| will be audio book narrators and voice actors.
| washadjeffmad wrote:
| Hardly. Imagine licensing your voice to Amazon so that any
| customer could stream any book narrated in your likeness
| without you having to commit the time to record. You could
| still work as a custom voice artist, all with a "no clone"
| clause if you chose. You could profit from your performance and
| craft in a fraction of the time, focusing as your own agent on
| the management of your assets. Or, you could just keep and
| commit to your day job.
|
| Just imagine hearing the final novel of ASoIaF narrated by Roy
| Dotrice and knowing that a royalty went to his family and
| estate, or if David Attenborough willed the digital likeness of
| his voice and its performance to the BBC for use in nature
| documentaries after his death.
|
| The advent of recorded audio didn't put artists out of
| business, it expanded the industries that relied on them by
| allowing more of them to work. Film and tape didn't put artists
| out of business, it expanded the industries that relied on them
| by allowing more of them to work. Audio digitization and the
| internet didn't put artists out of business; it expanded the
| industries that relied on them by allowing more of them to
| work.
|
| And TTS won't put artists out of business, but it will create
| yet another new market with another niche that people will have
| to figure out how to monetize, even though 98% of the revenues
| will still somehow end up with the distributors.
| nikkwong wrote:
| What you're not considering here is that a large majority of
| this industry is made up of no-name voice actors who have a
| pleasant (but perfectly substitutible) voice which is now
| something that AI can do perfectly and at a fraction of the
| price.
|
| Sure, celebrities and other well-known figures will have more
| to gain here as they can license out their voice; but the
| majority of voice actors won't be able to capitalize on this.
| So this is actually even more perverse because it again
| creates a system where all assets will accumulate at the top
| and there won't be any distributions for everyone else.
| bongodongobob wrote:
| The point is no one will pay for any of that if you can just
| clone someone's voice locally. Or just tell the AI how you
| want it to sound. Your argument literally ignores the entire
| elephant in the room.
| riquito wrote:
| I can see a future where the label "100% narrated by a human"
| (and similar in other industries) will be a thing
| tomcam wrote:
| Very impressive. It would take me a long time to even guess that
| some of these are text to speech.
| carbocation wrote:
| Curious if we'll see a Civitai-style LoRA[1] marketplace for
| text-to-speech models.
|
| 1 = https://github.com/microsoft/LoRA
| swyx wrote:
| silicon valley is very leaky, eleven labs is widely rumored to
| have raised a huge round recently. great timing because with
| OpenAI's TTS and now this thing the options in the market have
| just expanded greatly.
| readyplayernull wrote:
| Someone please create a TTS with marked-down
| emotions/intonations.
| wg0 wrote:
| The quality is really really INSANE and pretty much unimaginable
| in early 2000s.
|
| Could have interesting prospects for games where you have LLM
| assuming a character and such TTS giving those NPCs voice.
| abraae wrote:
| This is a big thing for one area I'm interested in - golf
| simulation.
|
| Currently playing in a golf simulator has a bit of a post-
| apocalyptian vibe. The birds are cheeping, the grass is
| rustling, the game play is realistic, but there's not a human
| to be seen. Just so different from the smacktalking of a real
| round, or the crowd noise at a big game.
|
| It's begging for some LLM-fuelled banter to be added.
| billylo wrote:
| Or the occasional "Fore!!"s. :-)
| wahnfrieden wrote:
| Is there a way to port this to iOS? Apple doesn't provide an API
| for their version of this.
| ddmma wrote:
| Well done, been waiting for a moment like this. Will give it a
| try!
| zsoltkacsandi wrote:
| Is it possible to optimize somehow the model to run a Raspberry
| with 4 GB of RAM?
| modeless wrote:
| I made a 100% local voice chatbot using StyleTTS2 and other open
| source pieces (Whisper and OpenHermes2-Mistral-7B). It responds
| _so_ much faster than ChatGPT. You can have a real conversation
| with it instead of the stilted Siri-style interaction you have
| with other voice assistants. Fun to play with!
|
| Anyone who has a Windows gaming PC with a 12 GB Nvidia GPU
| (tested on 3060 12GB) can install and converse with StyleTTS2
| with one click, no fiddling with Python or CUDA needed:
| https://apps.microsoft.com/detail/9NC624PBFGB7
|
| The demo is janky in various ways (requires headphones, has no UI
| to speak of, voice recognition sometimes fails), but it's a sneak
| peek at what will soon be possible to run on a normal gaming PC
| just by putting together open source pieces. The models are
| improving rapidly, there are already several improved models I
| haven't yet incorporated.
| lucubratory wrote:
| How hard on your end does the task of making the chatbot
| converse naturally look? Specifically I'm thinking about
| interruptions, if it's talking too long I would like to be able
| to start talking and interrupt it like in a normal
| conversation, or if I'm saying something it could quickly
| interject something. Once you've got the extremely high speed,
| theoretically faster than real time, you can start doing that
| stuff right?
|
| There is another thing remaining after that for fully natural
| conversation, which is making the AI context aware like a human
| would be. Basically giving it eyes so it can see your face and
| judge body language to know if it's talking too long and needs
| to be more brief, the same way a human talks.
| modeless wrote:
| Yes, I implemented the ability to interrupt the chatbot while
| it is talking. It wasn't too hard, although it does require
| you to wear headphones so the bot doesn't hear itself and get
| interrupted.
|
| The other way around (bot interrupting the user) is hard.
| Currently the bot starts processing a response after every
| word that the voice recognition outputs, to reduce latency.
| When new words come in before the response is ready it starts
| over. If it finishes its response before any more words
| arrive (~1 second usually) it starts speaking. This is not
| ideal because the user might not be done speaking, of course.
| If the user continues speaking the bot will stop and listen.
| But deciding when the user is done speaking (or if the bot
| should interrupt before the user is done) is a hard problem.
| It could possibly be done zero-shot using prompting of a LLM
| but you'd probably need a GPT-4 level LLM to do a good job
| and GPT-4 is too slow for instant response right now. A
| better idea would be to train a turn-taking model that
| predicts who should speak next in conversations. I haven't
| thought much about how to source a dataset and train a model
| for that yet.
|
| Ultimately the end state of this type of system is a complete
| end-to-end audio-to-audio language model. There should be
| only one model, it should take audio directly as input and
| produce audio directly as output. I believe that having TTS
| and voice recognition and language modeling all as separate
| systems will not get us to 100% natural human conversation. I
| think that such a system would be within reach of today's
| hardware too, all you need is the right training
| dataset/procedure and some architecture bits to make it
| efficient.
| causality0 wrote:
| What are the chances this gets packaged into something a little
| more streamlined to use? I have a lot of ebooks I'd love to
| generate audio versions of.
| carbocation wrote:
| Having now tried it (the linked repo links to pre-built colab
| notebooks):
|
| 1) It does a fantastic job of text-to-speech.
|
| 2) I have had no success in getting any meaningful zero-shot
| voice cloning working. It technically runs and produces a voice,
| but it sounds nothing like the target voice. (This includes
| trying their microphone-based self-voice-cloning option.)
|
| Presumably fine-tuning is needed - but I am curious if anyone had
| better luck with the zero-shot approach.
___________________________________________________________________
(page generated 2023-11-19 23:00 UTC)