[HN Gopher] Deep-learning text-to-speech tool for generating voi...
___________________________________________________________________
Deep-learning text-to-speech tool for generating voices of various
characters
Author : clxxx
Score : 263 points
Date : 2021-01-06 02:36 UTC (20 hours ago)
(HTM) web link (15.ai)
(TXT) w3m dump (15.ai)
| nmfisher wrote:
| From the about section:
|
| > How much does maintaining the servers cost? > It depends on the
| amount of traffic, but the minimum baseline is around several
| thousands of US dollars every month. This is expected as
| inference is very GPU intensive and a sufficient number of
| instances need to be spun up to handle thousands of requests
| coming in every minute. Everything is paid out of pocket.
|
| Wow, impressive commitment for something that's free.
| mickof wrote:
| You just sort of assume that this is correct? The person[1]
| running this comes across as a severely unstable character,
| that number is probably hyperbole.
|
| [1] https://twitter.com/fifteenai
| nmfisher wrote:
| I've worked with deep learning models enough to know the cost
| of running GPU inference, and if the live queue stats
| published on the website are accurate, then thousands of
| dollars per month is certainly plausible.
|
| I have no reason to disbelieve it.
| 15ai wrote:
| Not a hyperbole - I can provide proof if you'd like.
| nmfisher wrote:
| Separate question - is this English only? It looks like you
| can feed in phonemes but I assume this has been trained
| with English audio.
| hooloovoo_zoo wrote:
| It seems like one could get to those numbers pretty easily
| given the prices for GPU instances on AWS. Even just one
| decent-sized instance would be thousands of dollars per
| month.
| vsupalov wrote:
| Yeah, running anything related to AI involves GPU instances. An
| alternative is to point people to using Google Colab where you
| can get access to a GPU for free, but that's not a smooth end
| user experience for most folks.
| aisofteng wrote:
| > running anything related to AI involves GPU instances
|
| This is not true. A _lot_ of AI applications use algorithms
| such as logistic regression or random forests and don't need
| GPUs - partly, of course, because GPUs are so expensive and
| these approaches are good enough (or more than good enough)
| for many applications.
| vsupalov wrote:
| Whoops, sloppy generalization on my part. You're completely
| right of course, thanks! I've been focusing on deep
| learning a lot lately, to the point where AI has become an
| alias for those exciting new GPU-heavy techniques.
| calebkaiser wrote:
| The price of GPU inference can be brutal, but there's a lot you
| can do on the infra side to improve it:
|
| - Spot instances
|
| - Aggressive autoscaling
|
| - Micro batching
|
| Can reduce inference compute spend by huge amounts (90% is not
| uncommon). ML, especially anything involving realtime
| inference, is an area where effective platform engineering
| makes a ridiculous difference even in the earliest days.
|
| Source: I help maintain open source ML infra for GPU inference
| and think about compute spend way too much
| https://github.com/cortexlabs/cortex
| Nican wrote:
| Out of curiosity, as I have no visibility about the infra
| actually required- but at that cost, would it not be easier to
| just have a machine under a desk somewhere?
| calebkaiser wrote:
| Not for the kind of inference running here, I'd imagine.
|
| There are few key reasons why most realtime inference is done
| on the cloud:
|
| - Scale. Deep learning models especially tend to have poor
| latency, especially as they grow in size. As a result, you
| need to scale up replicas to meet demand at a way lower level
| of traffic than you do for a normal web app. At one point, AI
| Dungeon needed over 700 servers to support just thousands of
| concurrent players.
|
| - Cost. Related to the above, GPUs are really expensive to
| buy. A g4dn.xlarge instance (the most popular AWS EC2
| instance for GPU inference) is $0.526/hour on demand. To hit
| $3,000 per month in spend, you'd need to be running ~8 of
| them 24/7. Prices vary with purchasing GPUs, but you could
| expect 8 NVIDIA T4's to run around $20,000 at minimum, plus
| the cost of other components and maintainence. To be clear,
| that's very conservative--it's unlikely you'll get consistent
| traffic. What's more likely is you'll have some periods of
| very little traffic where you need one or two GPUs, and other
| high load periods where you'll need 10+.
|
| 3. Less universal of an issue, but the cloud gives you much
| better access to chips at lower switching costs. If NVIDIA
| releases a new GPU that's even better for inference,
| switching to it (once its available on your cloud) will be a
| tweak in your YAML. If you ever switch to ASICs like AWS's
| Inferentia or GCP's TPUs, which in many cases give way better
| performance and economics than GPUs, you'll also naturally
| have to be on their cloud.
|
| However, there is a lot that can be done to lower the cost of
| inference even in the cloud. I listed some things in a
| comment higher up, but basically, there are some assumptions
| you can make with inference that allow you to optimize pretty
| hard on instance price and autoscaling behavior.
| code51 wrote:
| I'm fearing this will end up with a massive debt on their part.
| Meph504 wrote:
| seriously fuck anyone that is putting in forced time delays on
| their terms, how about you let me read what it is you are doing
| before requiring shit like this.
| duckmysick wrote:
| If you don't agree with the terms, including how they are
| presented to you, you can always reject them and leave the
| site.
| atum47 wrote:
| While you're typing the word the text box don't show it, when you
| complete the word then it shows on the text box. Brave, Android.
|
| Besides that, amazing results. Congratulations.
| bravura wrote:
| 15ai, do you mind talking a bit about the methods you are using?
| uberman wrote:
| This was amazing!
| mvts wrote:
| Nice work on the Gordon Freeman Voice :D
| danShumway wrote:
| I don't usually expect much from demos like this, but I'm kind of
| surprised how impressive the results currently are. They're
| definitely not perfect, you're definitely getting some odd
| clipping and noise, but this shows a large amount of promise.
|
| Being able to generate voices for games would enable a lot of
| interesting indie projects. IMO people should be paying more
| attention the market implications of products like this than to
| the social implications. There are a lot of projects that just
| aren't really feasible right now that could be if this kind of
| technology was more polished and generally available for
| commercial/self-hosted use. And in those cases, you don't even
| need to do inference, makers will likely be willing to mark up
| their scripts themselves.
|
| Anyway I digress. Congrats, this is really cool!
| Pfhreak wrote:
| > people should be paying more attention the market
| implications of products like this than to the social
| implications.
|
| People will absolutely suffer harm from this tech, but hey,
| think about the dollars that could be made! No, we should
| absolutely be paying more attention to the social implications.
| C19is20 wrote:
| Musicians Union?
| danShumway wrote:
| Eh, this technology currently falls very squarely into the
| category of "almost good enough that I could use it for a
| creative project, but not _nearly_ good enough that you 're
| going to be able to convince me that the results aren't
| generated."
|
| I'm not primarily interested about the dollars, I'm
| interested in allowing communities to do creative things. I
| think people are looking at this tech like it's only going to
| be used for deepfakes, and they're underestimating the extent
| it's going to be used to create voice-acted game mods,
| animations, anonymization tools, and other creative/helpful
| projects.
|
| If you're really worried about this stuff though, you can
| take some comfort in the fact that by far the worst examples
| on the site are of real-world voices. This is currently
| technology that as far as I can see is far more suited for
| generating new voices or voicing cartoon characters with
| well-defined patterns/inflections than it is for imitating
| the president.
| Pfhreak wrote:
| You are looking at the current implementation and not
| thinking about the implication.
|
| One, this tech absolutely could be used to fool someone.
| Not everyone will be listening with a critical ear. Played
| back over a phone or injecting a phrase or two in otherwise
| spoken samples will fool many people.
|
| I guarantee you someone will be using this to make their
| own MLP episodes on YouTube specifically designed to scare
| children or get them to do awful things.
|
| Models presumably get better over time. It really won't be
| too much longer until people will be able to fake
| celebrities, politicians, exes, authority figures, etc. As
| a fairly benign example, if I had this in high school you
| better believe I could have called to excuse some of my
| absences.
|
| I agree, I love the idea of generating some decent voice
| lines for my own games projects, but this also introduces
| issues of the rights of the original voice actors.
|
| If you train a model to mimic a performance given by an
| actor, then use that model and fire the actor, isn't that
| potentially really problematic? (Also, it draws parallels
| to the Luddites who were not anti technology, but wanted to
| ensure that technology wasn't used in a way that reduced
| worker quality of life.)
|
| And yes, I think there are helpful ways this could be
| deployed. I'm gender fluid, and I'd love to be able to
| adjust my voice digitally, but we need to be thinking about
| how this could cause harm first.
| visarga wrote:
| I am thinking it could be used to impersonate someone in
| a phone call to a family member for conning.
| danShumway wrote:
| > One, this tech absolutely could be used to fool
| someone.
|
| The problem I have here is that it's already not hard to
| fool people. I don't think it's feasible for us to say
| that we're going to put something that could be highly
| beneficial on hold just because we don't want to deal
| with social education efforts that we kind of already
| need to tackle anyway. Per your example, if we get rid of
| deepfakes, it's not clear to me that Youtube is going to
| be any more safe. I already would not allow a child to
| browse Youtube unattended, people already generate the
| videos you're talking about.
|
| And I know that people are putting this in a different
| category than general CGI, voice modulation, or consumer-
| grade apps like Photoshop. I'm not going to argue that
| it's necessarily wrong for people to be worried, but no
| matter how many times people tell me that this is
| fundamentally different, I still have not seen any
| serious evidence that this technology is going to be more
| dangerous than Photoshop, and I think it's going to be
| way easier to detect than a decent Photoshop job is.
| Photoshop's content-aware paste/fill tools are better
| than this example, and they arguably require less work to
| use.
|
| And again... I'm sympathetic to concerns about moving too
| fast, but I just don't think there's any world, even if
| you could get rid of deepfakes entirely, where we don't
| need to be worried about media literacy and general
| skepticism. If people today don't realize that voices can
| already be convincingly faked, then that's a really
| serious problem, and if democratizing that ability causes
| society in general to become more aware of the potential
| of disinformation, then honestly that might even be a
| good thing that we should be encouraging.
|
| So sure, concerns, but in my mind people are focusing on
| one particular implication that I don't think is
| particularly likely, and ignoring that responding to that
| concern is probably going to look the same no matter what
| our position on deepfakes is.
|
| > If you train a model to mimic a performance given by an
| actor, then use that model and fire the actor, isn't that
| potentially really problematic?
|
| I think that's a very complicated question. I would not
| assume that the loss of work for voice actors, who can
| shift into voice generation roles, is going to be a big
| enough downside that it overrules the upside of allowing
| ordinary people to start generating their own vtube
| avatars or commenting on and building on top of existing
| culture.
| Ajedi32 wrote:
| > If people today don't realize that voices can already
| be convincingly faked, then that's a really serious
| problem, and if democratizing that ability causes society
| in general to become more aware of the potential of
| disinformation, then honestly that might even be a good
| thing that we should be encouraging.
|
| I've wondered about that angle as well. You can't put the
| genie back in the bottle, so maybe the best way to combat
| the threat of deepfaked misinformation is actually to
| take the opposite approach and make it as easy as
| possible for normal people to generate their own
| deepfakes; that way it becomes common knowledge that such
| things are possible (similar to how photoshop is common
| knowledge today).
| Erlich_Bachman wrote:
| > If you train a model to mimic a performance given by an
| actor, then use that model and fire the actor, isn't that
| potentially really problematic?
|
| And if you have to keep getting a person paid for
| something that a machine could do with (assuming, as per
| your post) 100% equal performance, that is not
| problematic? When the voice becomes as good as real
| actors, then yes of course they should become out of a
| job. Just like progress has been going on for thousands
| of years.
| bawolff wrote:
| It really doesnt have to be perfect to trick someone.
| You're expecting this site to be fake so you're listening
| carefully. If you weren't expecting anything and you were
| in the middle of a busy day at work, you are much much less
| likely to notice any discripencies.
|
| We already have stories like https://www.forbes.com/sites/j
| essedamiani/2019/09/03/a-voice...
|
| That said, as far as harms go, i dont think this is all
| that bad that it should preclude creative uses of this
| technology.
| significant5 wrote:
| I might be misunderstanding you, but there are no real-
| world voices on the site? All of them are of characters.
| danShumway wrote:
| I see a pretty linear drop in quality from Glados to
| Spongebob to Twilight Sparkle to the narrator from
| Stanley Parable to the 10th Doctor.
|
| It seems to struggle more and more as the voices get less
| cartoony/exaggerated.
| significant5 wrote:
| I'm not too sure about that. From my testing, Fluttershy,
| Applejack, Twilight, Chrysalis, Rise, and Kyu (and a
| bunch of other characters that I'm surely forgetting)
| seem to perform phenomenally well. Especially Chrysalis,
| her emotions are extremely believable, and
| Fluttershy/Applejack/Rise/Kyu have almost zero noise for
| every generation. This might be the most impressive site
| I've ever seen.
|
| Oh, I somehow forgot all of the TF2 characters. Some of
| them do struggle (Medic the most, I think) but everyone
| else seems incredibly good.
|
| And the Daria characters, too. Honestly, the vast
| majority of characters are already near-perfect.
| danShumway wrote:
| Hrm. Well, I can't really argue with that beyond that my
| standards on perfect might be different.
|
| I think some of the best voices they have are characters
| like Twilight, she shows a ton of promise. But as it
| stands right now, I would still at least hesitate to use
| Twilight's voice in a project unless I didn't have other
| options. Chrysalis's voice is good, but again, is an
| exaggerated cartoon character with a large amount of
| inflection. I would not use her voice in her current
| state without a lot of post-processing. Someone like the
| Spy I would consider to be unusable, it sounds to me like
| the character needs to clear their throat or something,
| it's got a lot of strange artifacts. I definitely would
| consider the 10th Doctor unusable, even for just a hobby
| project or a voice assistant.
|
| But... I don't know, maybe this is subjective. I can't
| just tell you that what you're hearing is wrong, if you
| like the results then you like the results :)
|
| And again, I don't want to detract from how impressive
| they are. They are incredibly impressive, particularly
| because of how characters like Chrysalis emote. Extremely
| promising. But I still think there's a difference between
| 'impressive' and 'believable deepfake'.
| significant5 wrote:
| Yeah, that's fair. I dunno, I can't really hear anything
| wrong with Fluttershy or Applejack no matter how hard I
| try, but your ears are probably much better than mine :p
|
| I've been seeing quite a few skits being posted on /r/tf2
| (https://www.reddit.com/r/tf2/comments/kr374q/honestly_id
| k_i_...) and all of the voices sound pretty much perfect
| to me. But as you said, it's subjective.
| Ajedi32 wrote:
| I wonder if there are any legal concerns with using the voices
| of well known characters/actors like this in a commercial
| context.
| danShumway wrote:
| I don't _think_ a voice can be copyrighted, but IANAL so you
| shouldn 't bank on that.
|
| If a voice could be copyrighted, or if this was a trademark
| issue or something, I strongly suspect that this site would
| _not_ fall under fair use regardless of whether or not it was
| commercial. But again, IANAL, so I don 't feel confident
| making any kind of strong claim about that either.
| dragonwriter wrote:
| > I don't think a voice can be copyrighted, but IANAL so
| you shouldn't bank on that.
|
| The audio content (which includes voices) of the source
| work is copyrighted, and a mechanical transform of that
| work (which deep learning to mimic the voices clearly is)
| would seem to be a derivative in at least the literal
| sense.
| thrill wrote:
| IANAL and I would say no. Anyone is free to imitate any
| else. A machine doesn't make that different. It would be
| a violation to claim you were someone else while doing
| the imitation.
| Baeocystin wrote:
| The fact that you included Chell as a voice choice (and
| 'generated' a null audio clip to boot) earns a chuckle. The
| quality of the voices across the board earns wide eyes and an
| eyebrow raise. Thanks for sharing this, it's remarkable work.
| high_byte wrote:
| _GLaDOS_ hahahaha this is just... perfect. _Stanley Parable
| Narrator_ funny you should mention this.
| demonictoaster wrote:
| The security implications of this kind of tech are scary. Going
| forward it will become really easy to reproduce the voice of
| anyone! It seems not a lot of training data is required to
| achieve reasonable results (e.g. Spong Bob is just 27min of
| voice, Half Life Black Mesa Announcer is just 1.9min!!). This
| stuff could be easily leveraged for scams and deep fakes (along
| with deep learning models that could also tweak lip movements to
| match the voice for example). Thankfully, there is also a very
| active area of research that leverages similar tech to detect
| deep fakes.
| dschooh wrote:
| These kinds of discussions are common with articles about deep
| fake video and audio. While I do not disagree with your point,
| here are two quick thoughts:
|
| - We have had perfect image manipulation capabilities for quite
| some time now. We have had written text manipulation
| capabilities for hundreds of years.
|
| - People will continue to believe what they believe, whether
| there is deep fake video and audio or not.
| demonictoaster wrote:
| Agree with you. Hopefully people are more and more aware that
| they cannot trust anything out there. We are soon reaching a
| point where we can make anyone say anything we want,
| including in audio and video format.
| spyder wrote:
| It's already happening:
|
| _A Voice Deepfake Was Used To Scam A CEO Out Of $243,000_ :
|
| https://www.forbes.com/sites/jessedamiani/2019/09/03/a-voice...
| vsupalov wrote:
| The results are really impressive. At the moment I'm considering
| spending a low 3-figure amount for a professionally spoken intro
| for a new podcast. Some of the lines I generated are in my top 5
| easily, human speakers don't have a lot of edge for short generic
| blurbs of text anymore it seems.
| SV_BubbleTime wrote:
| Is the author being cute putting Chell from Portal and Freeman
| from Half-life in there, and then there is no audio? It would be
| a weird oversight if not intentional because the author is
| clearly familiar with Valve games.
| trowngon wrote:
| Are there open source projects like this?
| CookieAnon wrote:
| I have CookieTTS where I reseach lots of experimental stuff.
| (You can see my credits on the 'Thanks' section of 15.ai)
|
| I can get about 90% of the quality of 15.ai currently. I think
| I could surpass 15.ai but not without some help.
| EugeneOZ wrote:
| Please give me a hint how to control the speed - Portal:Wheatly
| is too fast for me.
|
| Amazing toy! Thanks for "download" link, I'm creating a
| collection of GlaDOS phrases now.
| mensetmanusman wrote:
| As Alexa and Siri have improved over the last couple years and
| gotten a more human voice, it has been interesting observing my
| young children (1-4) interact with such devices.
|
| There is definitely a sense of 'who is that' coming from their
| little minds that they are sometimes quite perplexed about. 'It's
| a computer' is starting to feel like a cop-out answer as these
| things improve...
| MartinoPalmitos wrote:
| Half-Life's Gordon Freeman voice is really spot-on!
| kebman wrote:
| Pretty cool! I tried it with this small dialogue, and then edited
| together two voices in Reaper from the downloads:
|
| Bob: "Hello, John."
|
| John: "Oh, hello there, Bob."
|
| Bob: "Yes, hello. It's what I said. Why do you keep repeating
| what I say, John?"
|
| John: "I didn't repeat you! I merely said hello, you dimwit!"
|
| Bob: "There you go, being condescending again. Fuck you!"
|
| John: "What? You're the one who started it!"
|
| Try it yourself, or write something different. Either way, good
| fun!
| twangist wrote:
| I get nothing but "Error code 422: Server error", even on input
| "Hello", in FF, Safari and Chrome.
| durdn wrote:
| You may need to choose a "Source" in the top left. I got the
| same error before choosing a character.
| centimeter wrote:
| This is extremely impressive.
|
| I wonder if this will lead to a resurgence of "moon man" style
| videos with well-known characters rapping extremely offensive
| lyrics.
| [deleted]
| SommaRaikkonen wrote:
| Welp, after messing around with a few voices I was completely
| impressed with Glados's. This is really cool because I have no
| idea how the character's voice was synthesized, but apparently ML
| can do it for me so props to that.
| smrq wrote:
| I'm pretty sure the real Glados voice effect is mostly pitch
| correction and formant shifting. You can do it with Melodyne at
| least (which, to be fair, is also computer magic-- just a
| different kind than this one!)
|
| I just found a video on YT with an example of recreating this
| in Melodyne: https://youtu.be/1oQn66gvwKA
| jsheard wrote:
| If I remember correctly from the Portal developer commentary
| they did use voice synthesis, but only as a precursor.
|
| They used basic text-to-speech to read out the script then
| had the voice actress imitate the weird intonation of the TTS
| reading.
| giantrobot wrote:
| GLaDOS was voiced by a real person [0]. Her voice had some
| effects added but mostly just her trying to sound like a
| computer.
|
| [0] http://ellenmclain.net/
| aksss wrote:
| My favorite is Carl Butananadilewski, but I just ended up
| making him say actual phrases from ATHF in the end. Was hoping
| to see Meatwad as a character option.
| pure-struggle wrote:
| will this be open source eventually?
| pure-struggle wrote:
| https://twitter.com/fifteenai/status/1342304487474606081
|
| found an answer.
|
| "There's no point in releasing a poorly done model, and to do
| so for the sake of popularity would be despicable. My goal is
| to achieve indistinguishability, which I certainly know is
| possible. Anything short of near-perfection is unacceptable. "
| scrollaway wrote:
| Megalomania, always a great excuse.
|
| AI and ML users are massively benefiting from open source but
| too often refuse to release their data. It's like we're back
| in the middle ages and alchemy is back in style.
| hooloovoo_zoo wrote:
| Judging by how the model and site are put together, I think
| this is some software engineer's hobby project. Not wanting
| to spill their secrets doesn't make them a megalomaniac for
| the same reason being a magician doesn't make one a
| megalomaniac.
| scrollaway wrote:
| Except magicians do actually share their secrets; there
| is an active trade around it, conferences, discussions
| and lots of reading material available. The barrier of
| entry is higher than any old open source project but it's
| not inaccessible and comparable to alchemy.
|
| I was talking about ML in general, not just this project.
| See OpenAI and their latest release for example: no
| public product, no trained model. Just alchemy.
| 15ai wrote:
| I'm afraid this tweet is taken out of context. I had written
| this in response to complaints about the release date being
| delayed because I wanted to make sure that the released model
| (that is currently on the site) was the best it could be.
|
| I do plan to compile and publish my findings in the future,
| but nothing is set in stone yet. I know that the model can be
| improved even further, and I'd prefer to be as comprehensive
| as possible.
| whatshisface wrote:
| Releasing a poorly done intermediate result would give either
| competitors or colleagues a leg up in the race, depending on
| whether one sees them as competitors or colleagues.
| suyash wrote:
| fun but what are the legal implications of using these voices for
| projects? Does the license cover the use of these voices?
| Roritharr wrote:
| I'm pretty happy with the results I get. I've toyed around with a
| similar goal, but with the idea of approaching voice actors to
| give them a powerful tool to sell a "low quality" version of
| their voice in bulk. That way an up and coming author could use a
| tool like this and some elbow grease to create an Audiobook with
| famous voices.
| hmate9 wrote:
| It's incredible how little data is required for amazing output!
| Only a couple of minutes of talking needed.
|
| You can find a couple of minutes of taking of anyone, so the
| security implications are huge!
| superasn wrote:
| Really impressive. Do you plan to implement an API like Amazon,
| Google that lets you generate TTS for price?
| wongarsu wrote:
| I too think that this has potential as a cloud TTS service.
| However that does open up all the moral and legal cans of worms
| around this. I could imagine some of the voice actresses not
| being very happy about somebody else commercializing their
| voice without their consent.
|
| The obvious way to get around this is to keep this as the
| showcase and to pay some people to add their voices to the paid
| version. I imagine this would sell just based on being decent
| TTS with a wide range of voices, even when people don't know
| the voices offered.
| dnsiseuzb wrote:
| How does this compare to wellsaid labs?
| st1x7 wrote:
| You should really see what happens when you click reject on their
| terms and conditions prompt.
| bailey1541 wrote:
| If it doesn't work on mobile why bother sharing?
| clxxx wrote:
| It works on mobile for me. Tried it on both safari and chrome
| on an iPhone running iOS 14.
| rkagerer wrote:
| One of the voice actors is John de Lancie!
|
| https://soundcloud.com/user-860705643/q-pandemic-rant-no-mus...
| junon wrote:
| I'm rarely impressed by demos like this. This is a clear
| exception.
|
| Not only that, but the creator seems cool and down to earth.
| Thanks for sharing, this is incredible work.
___________________________________________________________________
(page generated 2021-01-06 23:04 UTC)