[HN Gopher] SoundStorm: Efficient Parallel Audio Generation
       ___________________________________________________________________
        
       SoundStorm: Efficient Parallel Audio Generation
        
       Author : sh_tomer
       Score  : 194 points
       Date   : 2023-07-16 16:53 UTC (6 hours ago)
        
 (HTM) web link (google-research.github.io)
 (TXT) w3m dump (google-research.github.io)
        
       | binary132 wrote:
       | When people wax eloquent about how the artisans will just find
       | something new to do for work, what they fail to mention is that
       | the new work is often a menial and lower-paid job. When Amazon
       | puts mom and pop shops out of business, they don't go start new
       | businesses, they go get jobs at Wal-Mart.
        
       | qwertox wrote:
       | In CGI there were always these milestones which I observed
       | getting reached. Like trees with leaves finally looking close to
       | realistic, wind blowing in grass looking almost realistic, hair,
       | jelly, and it were usually Pixar shorts pointing out what they
       | have been focusing on and then seeing it applied to their movies.
       | 
       | Then mocap, mapping digital faces on real actors which was first
       | mind-blowing to see in Pirates of the Caribbean, then the apes in
       | one of the Planet of the Apes movies... So much in the CGI
       | industry has already reached a point where the hardest problems
       | seem to have been solved.
       | 
       | When I now clicked play on the first Synthesized Dialoge from
       | Dialogue Synthesis "Where did you go last summer? | I went to
       | Greece, it was amazing.", I was blown away. It's as if we've now
       | reached one of those milestones where a problem appears to be
       | fixed or cracked. Machines will be able to really sound like
       | humans, indistinguishable from them.
       | 
       | 10-5 years ago, if you wanted to deal with TTS, the best option
       | you had was to let your Android phone render a TTS into an audio
       | file, because everything else sounded really bad. Specially Open
       | Source stuff sounded absolutely horrible.
       | 
       | So how long will it be until we will be able to download
       | something of this quality onto a future-gen Raspberry Pi which
       | can do some AI processing, where we make an HTTP call and it
       | starts speaking through the audio out in a perfect voice without
       | relying on the cloud? 5 years?
        
         | bckr wrote:
         | I would bet 2 years tops
        
         | amelius wrote:
         | Another question, how long until we have systems that can sing
         | 10 octaves and we don't need/want any actual human singers
         | anymore?
        
           | jayd16 wrote:
           | People like to song along though.
        
             | tialaramex wrote:
             | People like playing drums too, but a drum machine means
             | that if you're not any good at it or too busy but you need
             | drum sounds you can have drum sounds.
             | 
             | There are rights issues if the result is it replaces a
             | particular singer, if you made it so that Sneaker Pimps can
             | fire Kelli but still have her voice on subsequent songs
             | that's a problem. But suppose you're a bedroom musician,
             | and you realise you've got a piece that really wants
             | somebody with a different voice than yours to make it work
             | - you _can_ pay someone, but technology like this offers a
             | cheaper, easier option.
        
           | ttul wrote:
           | As a choral singer, if there's an app that one day allows me
           | to sing with a fake choir of extremely good singers, I would
           | enjoy doing that all day long. And it would allow my actual
           | choir to practice way more, making our performances far
           | better.
        
             | nraford wrote:
             | This exists right now!
             | 
             | Not as an app exactly, but you should check out Holly
             | Herndon and Mat Dryhurt's suite of tools called "Holly
             | Plus":
             | 
             | https://holly.plus/
             | 
             | I'm pretty sure you can access their model somehow and even
             | train your own voice using their "spawning" approach.
             | 
             | She did an awesome TED talk demonstrating this:
             | 
             | https://www.ted.com/talks/holly_herndon_what_if_you_could_s
             | i...
             | 
             | Here's a cool example, using Dolly Parton's song "Jolene":
             | 
             | https://www.youtube.com/watch?v=kPAEMUzDxuo
             | 
             | I don't think it's quite at the level of consumer use yet,
             | but I know they're working on it. Definitely check it out.
        
         | JonathanFly wrote:
         | >So how long will it be until we will be able to download
         | something of this quality onto a future-gen Raspberry Pi which
         | can do some AI processing, where we make an HTTP call and it
         | starts speaking through the audio out in a perfect voice
         | without relying on the cloud?
         | 
         | 5 years? It's probably possible roughly whenever the larger
         | Whisper models can run on it. Probably the next Raspberry Pi,
         | running quantized or optimized versions of some audio model.
         | 
         | It may be almost possible right now if you tried really realy
         | hard, and you used a small model fine-tuned on a single voice,
         | instead of something larger and more general purpose that can
         | do any voice. I think whisper-tiny works on a Pi on real time,
         | right? And that's not leveraging the GPU on the Pi.
         | (https://github.com/ggerganov/whisper.cpp/discussions/166)
         | 
         | Edit: looks like medium is 30x slower on the Pi than tiny
         | model, so I may have been overly optimistic. I didn't realize
         | Whisper tiny was that much faster than medium.
         | 
         | This method works pretty well with Tortoise, letting you use
         | the super fast Tortoise quality settings but get quality
         | similar to the larger models. Fine-tuning the whole thing on
         | just one voice removes a lot of the cool capabilities of
         | course. With Tortoise, that would still be way too slow for a
         | Pi but potentially that same strategy could work with faster
         | models like SoundStorm.
         | 
         | In terms of quality there's still a lot of room to go with long
         | term coherence, like long audio segments. When a real person
         | reads an audiobook the words at the top the page have a pretty
         | big impact on how many words at the bottom the page are read.
         | And there can be some impact at any distance, page 10 to page
         | 300. When you try audiobooks on super high end TTS models and
         | listen carefully you really notice the mismatch. It's like the
         | reader recorded the paragraphs out of order, or a video game
         | voice lines where you can tell the actors recorded all the
         | lines separately, and were not reacting to each other's
         | performance.
         | 
         | You can bump the context windows, a minute, two minutes. That's
         | gonna get you closer and probably good enough for some books.
         | In the short term a human could simply adjust all the all the
         | audio samples and manually tweak things to sound correct. So
         | this will enable fan-created audiobooks where they take the
         | time to get it right. But for fully automated books the
         | mismatch drives me nuts. The performance is just soooo close
         | for certain segments that when you get a tonal mismatch it
         | hurts.
        
         | nine_k wrote:
         | In you need a really compact form factor, you can buy a Jetson
         | right now and run more complex models on it. It's pricey
         | though.
        
       | JonathanFly wrote:
       | Interesting that SoundStorm was trained to produce dialog between
       | two people using transcripts annotated with '|' marking changes
       | in voice. But the exact same '|' characters seem to mostly work
       | in the Bark model out of the box and also produce a dialog?
       | 
       | Maybe a third or a bit more of Bark outputs are a dialog person
       | talking to _themselves_ -- and it often misses a voice change.
       | But the pipe characters do reliably produce audio that sounds
       | like a _dialog_ in the performance style.
       | 
       | https://twitter.com/jonathanfly/status/1675987073893904386
       | 
       | Is there some text-audio data somewhere in the training data that
       | uses | for voice changes?
       | 
       | Amusingly, Bark tends to render the SoundStorm prompts
       | sarcastically. Not sure if that's a difference in style in the
       | models, or just Google cherry picking the more straightforward
       | line readings as the featured samples.
        
         | og_kalu wrote:
         | The creators won't say as far as i know but bark looks to be
         | trained on lot of youtube corpora (rather than typical ML audio
         | datasets) where audio may have transcripts like that and why
         | stuff like [laughs] work
        
           | neilv wrote:
           | In the future, will children think it's normal to talk like,
           | "Hey, what up, Youtube! ... Be sure to like and subscribe!
           | ... Smash that like button! ... Let me know in the comments
           | down below!"?
           | 
           | I wonder how ML trained on the tone transitions to a
           | sponsored segment dripping with secret shame... would infect
           | general speech.
        
           | JonathanFly wrote:
           | Yeah I often try to think about what might be in a YouTube
           | caption when finding prompts that work in Bark. But pipe
           | character isn't one I remember seeing on YouTube. Maybe it's
           | part of some other audio dataset though. Or maybe it's on
           | YouTube but only in non English videos.
        
       | butz wrote:
       | With all recent advances, are there any decent TTS voices for
       | Linux that are not complicated to set up for regular user?
        
       | elAhmo wrote:
       | This is nothing short of amazing. It is exciting, a bit scary as
       | well, what the future will bring.
       | 
       | It just makes me sad that I cannot open this page on Safari. It
       | will not play a single audio, yet Chrome plays it fine. So here
       | we are, able to generate audio, video, code, do amazing things
       | with AI, but a simple website that has text and audio is not
       | working on the most popular laptop out there.
        
       | mg wrote:
       | I wonder if work marketplaces like UpWork and Fiverr will adapt
       | quickly enough to this new situation, where many of their
       | services, which in the past were done by humans, can now be done
       | by software.
       | 
       | Their current marketplace interface seems inadequate for this.
       | Instead of contacting a human and then wait for them to finish
       | the work, buyers will want to get results right away.
       | 
       | Therefore they will have to change their platform to work like an
       | app store. Where the sellers connect their services and buyers
       | can use these services.
        
         | seydor wrote:
         | > where many of their services, which in the past were done by
         | humans, can now be done by software.
         | 
         | Their users are already using AI to do the work that they are
         | supposed to do. i think that's fine
        
         | throw47474777j wrote:
         | Why wouldn't people just use existing software markets?
        
           | mg wrote:
           | For example?
        
             | throw47474777j wrote:
             | App Stores, the web, etc. How else does software as a
             | service get sold? It's not a new thing. Probably a lot of
             | these things will just end up as features in existing
             | systems.
        
               | mg wrote:
               | Existing appstores like the ones on iOS and Android
               | mostly target casual use cases, mobile devices and on-
               | device software. Not "buy once" experiences for work via
               | software as a service. They also do not offer a unified
               | experience. Two "text-to-speach" apps could have
               | completely different user interfaces.
               | 
               | The web does not have good discovery and reputation
               | management and also does not provide a unified interface.
               | That is why market places like Booking.com, Amazon,
               | Spotify etc have become so big.
        
         | Legend2440 wrote:
         | Why does everybody focus on "how will this replace humans?"
         | It's just a really good text-to-speech.
        
           | pjmlp wrote:
           | Maybe because I no longer hear friendly human voices on train
           | stations, rather computer generated train announcements?
           | 
           | While those people are now looking for jobs elsewhere.
        
             | Legend2440 wrote:
             | Fantastic! That's a massive efficency gain.
             | 
             | We will not run out of productive things to do with our
             | time. Labor force participation has stayed in 60-70%
             | despite centuries of automation.
        
               | pjmlp wrote:
               | Lovely capitalism.
        
             | relativ575 wrote:
             | Announcements often get played repeatedly -- "Train 101 to
             | Lisbon is now on track 5". Why do you want to torture
             | station's workers with that?
             | 
             | Instead, make an effort to start a conversation with your
             | fellow travelers, or graciously respond to such effort from
             | them. Apologize if you already do.
        
               | pjmlp wrote:
               | Better a tortured job that puts food on the table than
               | none at all.
        
               | cpill wrote:
               | tell that to the kids in Nike sweat shops
        
           | ImHereToVote wrote:
           | Personally I can't wait for all the streets to be lined with
           | the homeless like in SF. So good.
        
             | akaij wrote:
             | It's kinda sad to see you believe that this is the
             | inevitable outcome.
        
               | pjmlp wrote:
               | Well, if we imagine that the only thing that will be left
               | are physical jobs that can't be done by computers.
               | 
               | At least until they get clever enough to start a
               | transformers line factory.
        
               | Legend2440 wrote:
               | This is the lump of labor fallacy. It's not about "what
               | jobs will be left", it's about the new jobs we'll invent
               | with all the time we'll have on our hands.
               | 
               | There was never a fixed number of jobs, there's a fixed
               | number of workers.
        
               | pjmlp wrote:
               | Well, we can also return to feudalism.
        
           | PhasmaFelis wrote:
           | Because it _will_ replace humans, and that 's worth thinking
           | about?
        
       | nwoli wrote:
       | Seems like we wouldn't be far at all from just correlating this
       | to face movement (including subtle iris movement and blinks, not
       | just the mouth). As long as you clearly label it as CGI it's
       | harmless and I'm excited for the day to come. Might be quite fun
       | to chat with a little buddy this way
        
       | og_kalu wrote:
       | It's good that Bing, Bard are using the latest Microsoft, Google
       | Cloud offerings but it would be nice to see these speech advances
       | (along with audio palm - https://google-
       | research.github.io/seanet/audiopalm/examples/ etc) hit public
       | api's and/or user interfaces.
       | 
       | Bard's TTS is alright but it's clearly behind.
       | 
       | On that note, Bing's English/Korean TTS is really good. I also
       | didn't realize Microsoft uses the best offerings for free TTS on
       | edge so it blows google's default tts voices away.
        
         | jameszhao00 wrote:
         | Have you tried Google Cloud Studio voices?
         | 
         | https://cloud.google.com/text-to-speech/docs/wavenet#studio_...
        
           | og_kalu wrote:
           | Yes. I'm not saying Google's Top Cloud offerings are bad
           | although i still think microsoft's stuff is better.
           | 
           | Just that
           | 
           | 1. It's behind their current sota research
           | 
           | 2. You can only use those voices extensively by paying for
           | it. Microsoft offers their best stuff on edge for free. So
           | for reading aloud a pdf or web page, microsoft is far better.
        
             | jameszhao00 wrote:
             | By "SOTA" tts I think you mean LLM based TTS? With sound
             | and language tokens trained GPT style?
             | 
             | Without going into too much details, imo they're not really
             | usable right now for TTS use cases.
        
             | skybrian wrote:
             | It's disappointing, but I wouldn't expect research
             | algorithms to be available immediately unless they held it
             | back until the product is ready. I guess Apple would do
             | that?
        
         | GordonS wrote:
         | I used Azure TTS for a product demo voice-over recently, and
         | nobody I showed it to knew it wasn't a human doing it!
         | 
         | Some of Azure's voices are better than others, and the TTS web
         | app has a few minor bugs, but overall I was really pleased with
         | the whole experience.
        
         | refulgentis wrote:
         | > I also didn't realize Microsoft uses the best offerings for
         | free TTS on edge so it blows google's default tts voices away.
         | 
         | This sounds really interesting - can you share a bit more? I'm
         | behind in this space, my parser got all jammed up, something
         | like: "Microsoft uses [the best offerings for free TTS](as in
         | FOSS libraries, or free as in beer SaaS?) [on edge](Edge
         | browser, or on the edge as in client's computer?)(Is the
         | implication that all TTS on the client's computer blows
         | Google's default TTS voices away?)"
        
           | GranPC wrote:
           | I believe they mean that the free TTS feature in Microsoft
           | Edge uses their best technology, and that said tech is better
           | than Google's default offering.
        
           | og_kalu wrote:
           | The top voices you'd pay for on Azure's TTS services can be
           | used for free to read web page(and PDF) text on Microsoft
           | Edge. I don't mean Open source.
           | 
           | This is not the case with Google
        
             | wg0 wrote:
             | I didn't know that. Edge is too good. Just downloaded and
             | such features are great.
        
         | ShamelessC wrote:
         | > public api's and/or user interfaces
         | 
         | sigh. Google used to release _some_ models. Guess the fun early
         | days are coming to an end.
        
           | Legend2440 wrote:
           | Google is a business and this is clearly a valuable product.
        
             | ShamelessC wrote:
             | Sure, but there was a time not too long ago when companies
             | were still in the "good will" phase of handing out even
             | highly valuable models like CLIP, guided-diffusion, etc.
             | Come to think, it was mostly OpenAI doing this. And they
             | kinda still do? But far more selectively. I'm just
             | preemptively romanticizing that.
        
             | rasz wrote:
             | Product is something you sell to make money. The only real
             | Google product is users sold to advertisers.
        
               | vore wrote:
               | Uh, what about all of their paid cloud offerings?
        
               | rasz wrote:
               | Distraction. Generated whole 1% of overall profit last
               | quarter, and that was the first time it didnt lose money.
               | https://www.cnbc.com/2023/04/25/googles-cloud-business-
               | turns...
        
               | jsnell wrote:
               | Google's non-advertising revenue in the latest quarter
               | was about $15 billion. Is that significant amount of non-
               | ads product revenue? At least that is higher than the
               | revenue of any of IBM, HP, Oracle, Intel, Cisco, Netflix,
               | Broadcom, Qualcomm, or Salesforce in that same quarter.
               | 
               | I think their non-ads businesses alone would be the 6th
               | largest US tech company by revenue. (Amazon, Apple,
               | Microsoft, the ads business of Alphabet, Meta. Am I
               | forgetting something?)
        
               | rasz wrote:
               | Revenue is easy when you lose money on every dollar. Last
               | quarter Ads printed $21B of income, rest was a loss
               | except cloud not losing hundreds of $millions for the
               | very first time.
               | 
               | https://abc.xyz/assets/investor/static/pdf/2023Q1_alphabe
               | t_e...
        
             | joezydeco wrote:
             | [flagged]
        
               | vore wrote:
               | I don't want to defend Google's business practices, but
               | this is such a trite comment someone always feels
               | compelled to post on anything about Google, including
               | even a research paper, apparently.
        
               | joezydeco wrote:
               | I'll argue it's not trite. It's a concise compilation of
               | the thousands of teeth-gnashing comments here on HN and
               | all over the internet whenever Google randomly drowns
               | another one of its children.
               | 
               | Just fucking stay away from Google products. Period.
        
               | relativ575 wrote:
               | First of all it isn't a product. It's a f*king research
               | paper. Like dozens other showing up on HN every day. Most
               | of them never becomes a product.
               | 
               | Second of all, by whining nauseously you drown out
               | discussions on the merits of the technology, and chase
               | people away. I hardly read Google news on HN now
               | precisely because of that reason. Imagine if "Attention
               | is all your need" came out now? [0]
               | 
               | Save your complaint for when Google makes it a product.
               | 
               | [0] - https://news.ycombinator.com/item?id=15938082
        
               | serf wrote:
               | >Save your complaint for when Google makes it a product.
               | 
               | or save yourself the trouble and find alternatives to
               | big-G.
               | 
               | It's entirely their own fault that people now view all
               | Google news as temporary and fleeting. People don't want
               | to put time into things that'll get thrown away in a
               | year.
               | 
               | Reading G research papers seems like a shortcut to me,
               | know what will be thrown away in 2 years before it's a
               | valid product in 1 year and someone gets huckleberry'd
               | into devoting time and effort into implementing the dead-
               | product-walking API.
        
               | signatoremo wrote:
               | > It's entirely their own fault that people now view all
               | Google news as temporary and fleeting. People don't want
               | to put time into things that'll get thrown away in a
               | year.
               | 
               | Most of research don't become their own products, from
               | Google or anyone else. As a research project they still
               | have values, unless you are saying Google research is
               | garbage because they get into the habit of canceling
               | their products.
               | 
               | > or save yourself the trouble and find alternatives to
               | big-G
               | 
               | Totally valid point. No need to complain about it in a
               | post about Google research though. It's tiresome.
        
               | glimshe wrote:
               | It's a very relevant comment. It tells you to not rely,
               | or expect further development, on any new Google
               | technology, even seemingly good ones, as it can go to the
               | graveyard like many others.
        
               | georgemcbay wrote:
               | I don't bother to post the comment, but the high
               | likelihood of any Google project/product being killed
               | within a year or two is absolutely the first thought I
               | have whenever a new Google project/product is announced
               | (not because of HN posts, but because of their history),
               | so good job on that Google.
        
           | og_kalu wrote:
           | Ha i'm not even asking for code/model releases. It's just a
           | bit funny that what you can *pay* google to use is so far
           | behind what they have up and running collecting dust.
        
             | ShamelessC wrote:
             | Also true.
        
             | Raed667 wrote:
             | I'm speculating here, but for me it looks like the product
             | (R&D) teams are not working closely with the research
             | teams.
             | 
             | Even the demo website is on Github Pages instead of a
             | Google domain/blog.
        
       | asutekku wrote:
       | The most impressive part of this is that they are seemingly able
       | to produce 30 seconds of TTS with just 3 seconds of source
       | material. That is super cool and honestly much more further in
       | the curve that I expected it to be.
        
       | tagyro wrote:
       | I've wasted (counting) about 300 seconds of my life listening to
       | these audio files and they all sound and seem fake...
        
         | svantana wrote:
         | I found that in my (high quality) studio monitors, the audio
         | sounded fine and hard to distinguish from 24kHz wav. But in
         | headphones, the artifacts were pretty obvious. So probably some
         | reverberation will do a lot to cover up artifacts. In the
         | paper, they only do a subjective comparison between the
         | generated audio and the _soundstream-encoded_ original audio,
         | which seems a bit disingenuous. Listening to soundstream audio
         | in headphones, I can hear those same artifacts.
        
         | jeffbee wrote:
         | Did you read the paper? They intentionally steered the quality
         | to ensure they sound fake. Their generated speech is "very easy
         | to detect" according to the reference at the end of the paper.
        
         | tagyro wrote:
         | just to be clear, one could mistake them for some (voice) actor
         | reading a book (maybe) but even to my untrained ear they sound
         | fake and artificial.
         | 
         | Am i missing something?
        
           | kvn8888 wrote:
           | It's meant to sound artificial. The focus is on speed and
           | consistency
        
       | willemmerson wrote:
       | I don't have anything intelligent to say about this but it's ALOT
       | of fun making all the samples play at the same time - sort of
       | like the HTML version of Ableton Live.
        
       | [deleted]
        
       | anigbrowl wrote:
       | Good for fraudsters and spammers, bad for anyone who ever hoped
       | to make a living from voice acting. I'm perplexed by AI
       | technologists' seemingly incessant drive to automate away the
       | existence of artistic performers.
        
         | croes wrote:
         | Why spare artists if everyone else gets replaced by technology?
        
           | anigbrowl wrote:
           | They don't, otherwise there would be many former CEOs living
           | in tents. In reality, those who control large amounts of
           | capital are quite willing (and increasingly, say so in the
           | open) to to deprive others of their livelihoods, homes, and
           | ability to feed themselves in order to realize a marginal
           | increase in their own wealth.
        
         | Legend2440 wrote:
         | You are being deliberately pessimistic. There are a million
         | fantastic, practical uses for text-to-speech.
        
           | anigbrowl wrote:
           | I am not. The use cases like interactive assistants for the
           | blind will generate very little commercial activity compared
           | to the uses (and abuses) for entertainment and marketing
           | purposes. A good example of this from the real world is the
           | absence of cheap/open ASL interpretation for deaf people.
        
             | Legend2440 wrote:
             | Imagine having an app on your phone that turns any ebook
             | into an audiobook.
             | 
             | Imagine replacing crappy phone menus with polite virtual
             | assistants that actually understand what you're saying.
             | 
             | Imagine an AI language tutor that speaks every language in
             | the world fluently. Or a universal speech-to-speech
             | translator.
             | 
             | And that's just off the top of my head. Clever people will
             | come up with a lot more uses, I'm sure.
        
               | anigbrowl wrote:
               | I don't need your help imagining use cases; I've been in
               | this field a lot longer than you, and have talked up the
               | technological possibilities of AI-powered TTS here for
               | *years. I understand the technology very well and am
               | bullish on it. What I'm saying is that too much of the
               | effort is being spent in solving the wrong problems.
               | Please try reading what I wrote instead of your imaginary
               | subtext.
        
             | signatoremo wrote:
             | Ever notice big huge font on the phone of older people? So
             | big that a screen may only contain a few lines of text. Or
             | that people has to pull out their reading glasses every
             | time they check their phone? Text to speech is a godsend in
             | that case. Enormous benefits to an increasingly older
             | population.
        
               | anigbrowl wrote:
               | 'helping blind people' was literally the first use case I
               | mentioned. Maybe you should have read the comment before
               | reacting to it.
        
               | signatoremo wrote:
               | Huh? How big is the blind group compared to the older
               | population?
               | 
               | You are saying it's not economical to use tech to speech
               | to support blind people. I'm saying the benefits are huge
               | for older population. It isn't just for fraudsters or
               | spammers as you claim.
        
               | anigbrowl wrote:
               | No, I'm not saying that at all. I'm saying the resources
               | invested in helping people will be dwarfed by those
               | invested in crap designed to exploit them economically or
               | criminally.
        
               | signatoremo wrote:
               | Set asides the fact that you have absolutely no proof of
               | that claim, the criminal world is tiny compared to the
               | people who benefit from TTS (God forbid if that isn't the
               | case). Encryption, as an example, is hugely beneficial to
               | the regular people despite being used or exploited
               | extensively in the shady and questionable activities.
        
         | wg0 wrote:
         | LLMs aren't great and can't be relied upon in business setting
         | or at least I would not.
         | 
         | But think open world games. GTA VII for example where all NPCs
         | have their dialogs auto generated in real time but also
         | converted to audio in real time.
         | 
         | That's going to be a world which would be a lot more
         | spontaneous with lot less effort.
         | 
         | Right now, If memory serves me right, GTA V dialogs alone are
         | 5000 pages or more, hand written.
        
           | anigbrowl wrote:
           | That's all true, but I think it's a pity that the jobs that
           | currently exist for voice artists will disappear. Gamers and
           | consumers will have somewhat better interactive experiences,
           | which is good. Indie game developers will also be able to put
           | out games with lower budgets, which is nice for them. But the
           | market for voice acting work is largely going to dry up and
           | blow away for people who are not already at the top of that
           | field. People who could previously have made a modest but
           | sufficient living as voice performers will be replaced by
           | computer-generated voices. It will be almost impossible to
           | make a living in that field within 5 years.
        
             | wg0 wrote:
             | Generative models around images are nothing new and have
             | been around for a while already. But even today, if you
             | really want creative control and expression, you need a
             | designer that's good with Photoshop or Illustrator etc.
             | 
             | This is applicable to LLMs as well. You can get it to write
             | plausible BS but if you really want a rooted in reality,
             | well articulated write up about something, a human has to
             | be taken onboard.
             | 
             | This equally extends to voice over. If you really want
             | expressive and creative control to put some outstanding
             | rendering of something, AI isn't going to cut it.
        
               | anigbrowl wrote:
               | This is only true if you assume AI isn't going to keep
               | improving. It gets significantly better on a quarterly
               | basis, far faster than the time it takes for an actor to
               | develop their craft and career. The output quality of
               | todays' cutting edge models would have been science
               | fiction only 2-3 years ago.
        
               | wg0 wrote:
               | I'm not so sure about the future. Such models, all the
               | models don't have a well understood input output mapping
               | and that's going to be a problem for a very long time.
        
       ___________________________________________________________________
       (page generated 2023-07-16 23:00 UTC)