[HN Gopher] OpenVoice: Versatile instant voice cloning
___________________________________________________________________
OpenVoice: Versatile instant voice cloning
Author : ulrischa
Score : 390 points
Date : 2024-03-29 07:50 UTC (15 hours ago)
(HTM) web link (research.myshell.ai)
(TXT) w3m dump (research.myshell.ai)
| andrewstuart wrote:
| If someone can come up with a voice clinging product that I can
| run on my own computer not the cloud, and if it's super simple to
| install and use, then I'll pay.
|
| I find it hard to understand why so much money is going into ai
| and so many startups are building ai stuff and such a product
| does not exist.
|
| It's got to run locally because I'm not interested in the
| restrictions that cloud voice cloning services impose.
|
| Complete, consumer level local voice cloning = payment.
| dsign wrote:
| I couldn't agree more.
|
| I've tried some of this ".ai" websites that do voice-cloning,
| and they tend to use the following dark strategy:
|
| - Demand you create a cloud account before trying.
|
| - Sometimes, demand you put your credit card before trying.
|
| - Always: the product is crap. Sometimes it does voice-cloning
| sort of as advertised, but you have to wait for the training
| and the execution in queue, because cloud GPUs are expensive
| and they need to manage a queue because it's a _cloud_
| prouduct. At least that part could be avoided if they shipped a
| VST plugin one could run locally, even if it 's restricted to
| NVidia GPUs[^2].
|
| [^1]: To those who say "but the devs must get paid": yes. But
| subscriptions miss-align incentives, and some updates are
| simply not worth the minutes they cause in productivity lost
| while waiting for their shoehorned installation.
|
| [^2]: Musicians and creative types are used to spend a lot in
| hardware and software, and there are inference GPUs which are
| cheaper than some sample libraries.
| andrewstuart wrote:
| I don't mind if the software is a subscription it just has to
| be installable and not spyware garbage.
|
| Professional consumer level software like a game or
| productivity app or something.
| riwsky wrote:
| How do you figure subscriptions misalign incentives? The
| alternative, of selling upgrades, incentivizes devs to focus
| on new shiny shit that teases well. I instead rather they
| focus on making something I get value out of consistently.
| dsign wrote:
| - A one-off payment makes life infinitely simpler for
| accounting purposes. In my jurisdiction, a software license
| owned by the business is an asset and shows as that in the
| balance sheet, and can be subject to a depreciation
| schedule just as any other asset.
|
| - Mental peace: if product X does what I need right now and
| I can count that I will be able to use product X five years
| from now to do the same thing, then I'm happy to pay a lump
| sum that I see as an investment. Even better, I feel
| confident that I can integrate product X in my workflows. I
| don't get that with a subscription product on the hands of
| a startup seeking product-market fit.
| andoando wrote:
| I made a voice cloning site. https://voiceshift.ai No login,
| nothing required. Its a bit limited but I can add any of the
| RVC models. Working on a feature to just upload your own
| model.
|
| I can definitely make it a local app.
| smusamashah wrote:
| Buy this one is supposed to be runnable locally. It has
| complete instructions on Github including downloading models
| locally and installing python setting it up and running it.
| andrewstuart wrote:
| I'm wanting to download an installer and run it - consumer
| level software.
| ddtaylor wrote:
| I can show you how to use Bark AI to do voice cloning.
| rexreed wrote:
| What local hardware is needed to run Bark AI? What is the
| quality? Looking for something as good or better than Eleven
| Labs.
| ddtaylor wrote:
| It can run on CPU without much issue and takes up a few
| gigs of RAM and will produce about in realtime. If you GPU
| accelerate you only need about 8GB of video memory and it
| will be at least 5X faster.
|
| Out of the box it's not as good as Eleven Labs based on
| their demos, but those are likely cherry picked. There are
| some tunable parameters for the Bark model and most
| consider the output high enough quality to pass into
| something else that can do denoising.
| mdrzn wrote:
| Please do!
| ipsum2 wrote:
| How much would you pay? I can make it.
| andrewstuart wrote:
| You can't sell this cause the license doesn't allow it.
| ddtaylor wrote:
| Bark is MIT licensed for commercial use.
| ipsum2 wrote:
| Not using this model, but something similar. How much would
| you pay?
| ipsum2 wrote:
| Based on the lack of replies, the answer appears to be
| $0.
| pmontra wrote:
| "This repository is licensed under a Creative Commons
| Attribution-NonCommercial 4.0 International License, which
| prohibits commercial usage"
|
| People could pay somebody for the service of setting up the
| model on their own hardware, then use the model for non
| commercial usage.
| GTP wrote:
| IANAL, but this looks like a grey area to me: it could be
| argued that the person/company getting paid to do the
| setup is using the model commercially.
| GTP wrote:
| Doesn't allow it _yet_ , but on the readme, they write
| "This will be changed to a license that allows Free
| Commercial usage in the near future". So someone will soon
| be able to sell it to you.
| washadjeffmad wrote:
| I mean this at large, but I just can't get over this "sell me a
| product" mentality.
|
| You already don't need to pay; all of this is happening
| publication to implementation, open and local. Hop on Discord
| and ask a friendly neon-haired teen to set up TorToiSe or xTTS
| with cloning for you.
|
| Software developers and startups didn't create AGI, a whole lot
| of scientists did. A majority of the services you're seeing are
| just repackaging and serving foundational work using tools
| already available to everyone.
| TuringTest wrote:
| I agree, buy playing devil's advocate, it's true that people
| without the time and expertise to setup their own install can
| find this packaging valuable enough to pay for it.
|
| It would be better for all if, in Open Source fashion, this
| software had a FLOSS easy-to-install packaging that provided
| for basic use cases, and developers made money by adapting it
| to more specific use cases and toolchains.
|
| (This one is not FLOSS in the classic sense, of course. The
| above would be valid for MIT-licensed or GPL models).
| nprateem wrote:
| You can extend that reasoning to anything, but time and
| energy are limited
| lancesells wrote:
| The answer is convenience. Why use dropbox when you can run
| Nextcloud? You can say the same thing about large companies.
| Why does Apple use Slack (or whatever they use) when they
| could build their own? Why doesn't Stripe build their own
| data centers?
|
| If I had a need for an AI voice for a project I would pay the
| $9 a month, use it, and be done. I might have the skills to
| set this up on my machine but it would take me hours to get
| up to speed and get it going. It just wouldn't be worth it.
| endisneigh wrote:
| I see these types of comments all the time, but fact is folks
| at large who wouldn't use the cloud version won't pay. The kind
| of person who has a 4090 to run these sort of models would just
| figure out how to do it themselves.
|
| The other issue is that paying for the software once doesn't
| capture as much of the value as a pay per use model, thus if
| you wanted to sell the software you'd either have to say you
| can only use it for personal use, or make it incredibly
| expensive to account for the fact that a competitor would just
| use it.
|
| Suppose there were such a thing - then folks may complain that
| it's not open source. Then it's open sourced, but then there's
| no need to pay.
|
| In any case, if you're willing to pay $1000 I'm sure many of us
| can whip something up for you. Single executable.
| palmfacehn wrote:
| XTTS2 works well locally. Maybe someone else here can recommend
| a front end.
| rifur13 wrote:
| Wow perfect timing. I'm working on a sub-realtime TTS (only on
| Apple M-series silicon). Quality should be on-par or better
| than XTTS2. Definitely shoot me a message if you're interested.
| jeroenhd wrote:
| RVC does live voice changing with a little latency:
| https://github.com/RVC-Project/Retrieval-based-Voice-Convers...
|
| The product isn't exactly spectacular, but most of the works
| seems to have bene done. Just needs someone to go over the UI
| and make it less unstable, really.
| bsenftner wrote:
| Is there a service here somewhere? The video mentions lower
| expense, but I can't find any such service sign up... (ah, all
| the usage info is all on github)
|
| Has anyone tried self hosting this software?
| smusamashah wrote:
| The quality here is good (very good if I can actually run it
| locally). As per github it looks like we can run it locally.
|
| https://github.com/myshell-ai/OpenVoice/blob/main/docs/USAGE...
| 486sx33 wrote:
| Still a bit robotic but better highs and lows for sure. The
| Catalog is huge! Thanks for posting
| paraschopra wrote:
| yay!
| nonrandomstring wrote:
| I also lost my voice in a bizarre fire breathing accident and
| urgently need to log into my telephone banking account.
|
| Can anyone here give me a short list of relatively common ethical
| use cases for this technology. I'm trying to refute the
| ridiculous accusation that only deceptive, criminal minded people
| would have any use for this. Thanks.
|
| ----
|
| Edit: Thanks for the many good-faith replies. So far I see these
| breaking down into;
|
| Actual disability mitigation (of course I was joking about the
| circus accident). These are rare but valid. Who wouldn't want
| their _own_ restored.
|
| Entertainment and games
|
| Education and translation
|
| No further comment on the ethics here, but FWIW I'm nervously
| looking at this having just written a study module on social
| engineering attacks, stalking, harassment and confidence tricks.
| :/
|
| And yes, as bare tech, it _is_ very cool!
| BriggyDwiggs42 wrote:
| Idk but its kinda cool
| RobotToaster wrote:
| Could be combined with translation to automatically create dubs
| for videos/tv/etc.
| Larrikin wrote:
| Home Assistant is making huge progress in creating an open
| source version of Alexa, Siri, etc. You can train it to use
| your voice, but the obvious use is celebrity voices for your
| home. Alexa had them, then took them away, and refused to
| refund people.
| diggan wrote:
| > but the obvious use is celebrity voices for your home
|
| Beside the fact that it seems more like a "entertainment" use
| case rather than "functional", is it really ethical to use
| someone's voice without asking/having rights to use it?
|
| Small concern, granted, but parent seems to have specifically
| asked for ethical use cases.
| ChrisMarshallNY wrote:
| I believe that a number of celebrities (I think Tom Hanks
| is one) have already sued companies for using deepfakes of
| their voices. Of course, the next year (in the US) is gonna
| see a _lot_ of stuff generated by AI.
| corobo wrote:
| I imagine Stephen Hawking would have found a use for this had
| it been available before everyone got used to his computer
| speaking voice. Anything that may cause someone to lose their
| ability to speak along the lines of your example.
|
| Another might be for placeholdering - you could use an array of
| (licensed and used appropriately) voices to pitch a TV show,
| film, radio show, podcast, etc to give people a decent idea of
| how it would sound to get financing and hire actual people to
| make the real version. Ofc you'll need an answer to "why don't
| we just use these AI voices in the actual production?" from
| people trying to save a few quid.
|
| Simple one- for fun. I'm considering AI cloning my voice and
| tinkering around until I find something useful to do with it.
| Maybe in my will I'll open source my vocal likeness as long as
| it's only to be used commercially as the voice of a spaceship's
| main computer or something. I'll be a sentence or two factoid
| in some Wikipedia article 300 years from now, haha.
|
| Universal translator - if an AI can replicate my voice it could
| have me speak all sorts of languages in real-time.. sucks to be
| a human translator admittedly in this use case. Once the tech
| is fully ironed out and reliable we could potentially even get
| rid of "official" languages (eg you have to speak fluent
| English to be an airline pilot - heck of a learning curve on
| top of learning to be a pilot if you're from a country that
| doesn't teach English by default!)
|
| I dunno if it'd be a weird uncanny valley thing, I wonder how
| an audiobook would sound reading a book in my own voice -
| unless I'm fully immersed in fiction that's generally how I
| take in a book, subvocalising with my voice in my head - maybe
| it'd help things bed in a bit better if it's my own voice
| reading it to me? If so I wonder how fast I could have AI-me
| read and still be able to take in the content with decent
| recall.. Might have to test this one!
|
| Splintering off the audiobook idea - I wonder if you could help
| people untrain issues with their speaking in this manner? Like
| would hearing a non-stuttering version of their voice help
| someone with a stutter? I am purely in the land of hypothesis
| at this stage, but might be worth trying! Even if it doesn't
| help in that way, the person with a stutter would at least have
| a fallback voice if they're having a bad day of it :)
|
| E: ooh, having an AI voice and pitch shifting it may help in
| training your voice to sound different, as you'd have something
| to aim for - "I knew I could do it because I heard it being
| done" sort of theory. The first example that popped into my
| head was someone transitioning between genders and wanting to
| adjust their voice to match the change.
|
| I imagine there's other fields where this may be useful too -
| like if you wanted a BBC news job and need to soften out your
| accent (if they still require Received Pronunciation, idk)
|
| Admittedly I could probably come up with more abuse cases than
| use cases if I put my mind to it, but wanted to stick to the
| assignment :)
| mywacaday wrote:
| Charlie Bird, a very well know Irish journalist and
| broadcaster who recently passed away from motor neuron
| disease went through the process of getting a digitized
| version of his voice done as part of a TV program as he was
| losing his voice rapidly at the time. The result was very
| good as they had a large body of his news reports to train
| the model on. Most Irish people would be very familiar with
| his voice and the digitized version was very convincing. I
| would imagine something like this would be great for people
| who wouldn't have a huge volume of recordings to work off. A
| short video by the company that provided the tablet with his
| voice is here https://www.youtube.com/watch?v=UGjJHVUyi0M
| CapsAdmin wrote:
| The practical main use case I can think of is entertainment.
| Games could use it, either dynamically or prerecorded. Amateur
| videos could also use it for fun.
|
| Outside of that, more versatile text to speech is generally
| useful for blind people.
|
| More emotional and non-robotic narration of text can also be
| useful non-blind people on the go.
| andrewmcwatters wrote:
| It would be neat to have your game client locally store a
| reference sentence on your system and generate voice chat for
| you at times when you couldn't speak and could only type.
| pmontra wrote:
| I want to use (almost) my own voice in an English video without
| my country's accent?
| raudette wrote:
| For creating games/entertainment/radio drama, allows 1 person
| to voice act multiple roles
| serbrech wrote:
| On the fly speech translation but in the voice of the speaker
| 7373737373 wrote:
| Or a different voice if the voice of the speaker or the way
| they talk is annoying
| idle_zealot wrote:
| It's mostly interesting to me for artistic applications, like
| voicing NPC or video dialog, or maybe as a digital assistant
| voice. Being able to clone existing voices would be useful for
| parody or fanworks, but I suspect that it is also possible to
| mix aspects of multiple voices to synthesize new ones to taste.
| napkin wrote:
| I'm currently using xtts2 to make language learning more
| exciting, by training models on speakers I wish to emulate. I'm
| really into voices, and this has helped tremendously for
| motivation when learning German.
| laurentlb wrote:
| I think there are lots of applications for good Text-To-Speech.
|
| Cloning a voice is a way to get lots of new voices to use as
| TTS.
|
| I'm personally building a website with stories designed for
| language learners. I'd like to have a variety of realistic
| voices in many languages.
| freedomben wrote:
| The reason I am looking for something, is because a friend of
| mine died of cancer, but left some voice samples, and I want to
| narrate an audiobook for his kids in his voice.
|
| In general, though, I agree, the legitimate use cases for
| something like this seem relatively minor compared to the
| illegitimate use cases. However, the technology is here, and
| simply depriving every one of it isn't going to stop the
| scammers, as has already been evidenced. In my opinion, the
| best thing for us to do is to rapidly get to a place where
| everybody knows that you cannot trust the voice on the other
| end anymore, as it could be cloned. Fortunately, the best way
| to accomplish that is also the same way that we allow average
| people to benefit from the technology: make it widely available
| nonrandomstring wrote:
| > In my opinion, the best thing for us to do is to rapidly
| get to a place where everybody knows that you cannot trust
| the voice on the other end anymore,
|
| Strongly agree with this. Sadly I don't think that transition
| to default distrust of voice will be rapid. We are wired at
| quite a low level to respond to voice emotionally. which
| bypasses our rational scepticism and vigilance. That's why
| this is a rather big win for the tricksters.
| _agt wrote:
| At my university, we're using this tech to insert minor
| corrections into lecture recordings (with instructor's consent
| of course). Far more efficient than bringing them into a studio
| for a handful of words, also less disruptive to content than
| overlaid text.
| RyanCavanaugh wrote:
| I'd really like to make some video content (on-screen graphics
| + voice), but the thought of doing dozens of voice takes and
| learning to use editing software is really putting me off from
| it. I'd really rather just write a transcript, polish it until
| I'm satisfied with it, and then have the computer make the
| audio for me.
|
| I'll probably end up just using OpenAI TTS since it's good
| enough, but if it could be my actual voice, I'd prefer that.
| dougmwne wrote:
| In related news, Voicecraft published their model weights today.
|
| https://github.com/jasonppy/VoiceCraft
| jasonjmcghee wrote:
| The quality of the output is really fantastic compared with other
| open source (next best XTTSv2).
|
| The voice cloning doesn't seem as high quality as other products
| I've used / seen demos for. Most of the examples match pitch
| well, but lose the "recognizable" aspect. The Elon one just
| doesn't sound like Elon, for example- interestingly the
| Australian accent sounds more like him.
| duggan wrote:
| With a bit of coaxing I managed to get this running on my M2 mac
| with Python 3.11.
|
| Updated setup.py (mostly just bumping versions), and demo output:
| https://gist.github.com/duggan/63b7de9b5f6e8e74fe4b05af64dbe...
| smashah wrote:
| Terrifying.
| riskable wrote:
| I know, right!? Soon everything is going to be AI-enabled and
| our toothbrushes will be singing us Happy Birthday!
| randkyp wrote:
| This is HN, so I'm surprised that no one in the comments section
| has run this locally. :)
|
| Following the instructions in their repo (and moving the
| checkpoints/ and resources/ folder into the "nested" openvoice
| subfolder), I managed to get the Gradio demo running. Simple
| enough.
|
| It appears to be quicker than XTTS2 on my machine (RTX 3090), and
| utilizes approximately 1.5GB of VRAM. The Gradio demo is limited
| to 200 characters, perhaps for resource usage concerns, but it
| seems to run at around 8x realtime (8 seconds of speech for about
| 1 second of processing time.)
|
| EDIT: patched the Gradio demo for longer text; it's way faster
| than that. One minute of speech only took ~4 seconds to render.
| Default voice sample, reading this very comment:
| https://voca.ro/18JIHDs4vI1v I had to write out acronyms -- XTTS2
| to "ex tee tee ess two", for example.
|
| The voice clarity is better than XTTS2, too, but the speech can
| sound a bit stilted and, well, robotic/TTS-esque compared to it.
| The cloning consistency is definitely a step above XTTS2 in my
| experience -- XTTS2 would sometimes have random pitch shifts or
| plosives/babble in the middle of speech.
| bambax wrote:
| I am trying to run it locally but it doesn't quite work for me.
|
| I was able to run the demos allright, but when trying to use
| another reference speaker (in demo_part1), the result doesn't
| sound at all like the source (it's just a random male voice).
|
| I'm also trying to produce French output, using a reference
| audio file in French for the base speaker, and a text in
| French. This triggers an error in api.py line 75 that the
| source language is not accepted.
|
| Indeed, in api.py line 45 the only two source languages allowed
| are English and Chineese; simply adding French to
| language_marks in api.py line 43 avoids errors but produces a
| weird/unintelligible result with a super heavy English accent
| and pronunciation.
|
| I guess one would need to generate source_se again, and
| probably mess with config.json and checkpoint.pth as well, but
| I could not find instructions on how to do this...?
|
| Edit -- tried again on https://app.myshell.ai/ The result
| sounds French alright, but still nothing like the original
| reference. It would be absolutely impossible to confuse one
| with the other, even for someone who didn't know the person
| very well.
| randkyp wrote:
| I played with it some more and I have to agree. For actual
| voice _cloning_, XTTS2 sounds much, much closer to the
| original speaker. But the resulting output is also much more
| unpredictable and sometimes downright glitchy compared to
| OpenVoice. XTTS2 also tries to "act out" the implied
| emotion/tone/pitch/cadence in the input text, for better or
| worse.
|
| But my use case is just to have a nice-sounding local TTS
| engine, and current text-to-phoneme conversion quirks aside,
| OpenVoice seems promising. It's fast, too.
| echelon wrote:
| And StyleTTS2 generalizes out of domain even better than
| that.
| epiccoleman wrote:
| I have got to build or buy a new computer capable of playing
| with all this cool shit. I built my last "gaming" PC in 2016,
| so its hardware isn't really ideal for AI shenanigans, and my
| Macbook for work is an increasingly crusty 2019 model, so
| that's out too.
|
| Yeah, I could rent time on a server, but that's not as cool as
| just having a box in my house that I could use to play with
| local models. Feels like I'm missing a wave of fun stuff to
| experiment with, but hardware is expensive!
| beardedwizard wrote:
| I would love a recommendation for an off the shelf "gpu
| server" good for most of this that I can run at home.
| lakomen wrote:
| I'm clueless about AI, but here's a benchmark list
| https://www.videocardbenchmark.net/high_end_gpus.html
|
| Imo the 4070 super is the best value and consumes the least
| amount of Watts, 220 in all the top 10.
|
| So anything with one and some ECC RAM aka AMD should be
| fine. Intel non-xeons need the expensive w680 boards and
| very specific RAM per board.
|
| ECC because you wrote server. We're professionals here
| after all, right?
| antonvs wrote:
| What if I enjoy gambling with cosmic ray bitflips?
| GTP wrote:
| Maybe they would make your AI model evolve into an AGI
| over time :D
| lardo wrote:
| CivitAI has one https://civitai.com/builds
| macrolime wrote:
| Mac Studio or macbook pro if you want to run the larger
| models. Otherwise just a gaming pc with an rtx 4090 or a
| used rtx 3090 if you want something cheaper. A used dual
| 3090 can also be a good deal, but that is more in the build
| it yourself category than off the shelf.
| 101008 wrote:
| Sorry if this is a silly question - I was never a Mac
| user, but I quick googled Mac Studio and it seems it's
| just the computer. Can I plug it to any monitor / use any
| keyboard and mouse, or do I need to use everything from
| Apple with it?
| timschmidt wrote:
| Any monitor and keyboard will work, however Apple
| keyboards have a couple extra keys not present on Windows
| keyboards so require some key remapping to allow access
| to all typical shortcut key combinations.
| spectre3d wrote:
| Mainly to swap the Windows and Alt keys, which you can do
| in System Settings without any additional software.
|
| If you use a mouse with more than right-click and scroll
| wheel, with side buttons for example, then you'll need
| extra software.
| macrolime wrote:
| You can, but with some caveats. Not all screen
| resolutions work well with MacOS, though using
| BetterDisplay it will still usually work. If you want
| touch id, it's better to get the Magic Keyboard with
| touch id.
| pksebben wrote:
| I went the 4090 route myself recently, and I feel like
| all should be warned - memory is a major bottleneck. For
| a lot of tasks, folks may get more mileage out of
| multiple 3090s if they can get them set up to run
| parallel.
|
| Still waiting on being able to afford the next 4090 +
| egpu case et al. There are a lot of things this rig
| struggles with running OOM, even on inference with some
| of the more recent SD models.
| ckl1810 wrote:
| Depending on what models you want to run, RTX 4090 or RTX
| 3090 may not be enough.
|
| Grok-1 was running on a M2 Ultra with 196GB of ram.
|
| https://twitter.com/ibab_ml/status/1771340692364943750
| holtkam2 wrote:
| I'm in exactly the same boat. Yeah ofc you can run LMs on
| cloud servers but my dream project would be to construct a
| new gaming PC (mine is too old) and serve a LM on it, then
| serve an AI agent app which I can talk to from anywhere.
|
| Has anyone had luck buying used GPUs, or is that something I
| should avoid?
| sangnoir wrote:
| > its hardware isn't really ideal for AI shenanigans
|
| FWIW, I was in the same boat as you and decided to start
| cheap, old game machines can handle AI shenanigans just fine
| wirh the right GPU. I use a 2017 workstation (Zen1) and an
| Nvidia P40 from around the same time, which can be had for
| <$200 on ebay/Amazon. The P40 has 24GB VRAM, which is more
| than enough for a good chunk of quantized LLMs or diffusion
| models, and is in the same perf ballpark as the free Colab
| tensor hardware.
|
| If you're just dipping your toes without committing, I'd
| recommend that route. The P40 is a data center card and
| expects higher airflow than desktop GPUs, so you probably
| have to buy a "blow kit" or 3D-print a fan shroud and ensure
| they fit inside your case. This will be another $30-$50. The
| bigger the fan, the quieter it can run. If you already have a
| high-end gamer PC/workstation from 2016, you can dive into
| local AI for $250 all-in.
|
| Edit: didn't realize how cheap P40s now are! I bought mine a
| while back.
| zoklet-enjoyer wrote:
| I forgot all about Vocaroo!
| causi wrote:
| We're so close to me being able to open a program, feed in an
| epub, and get a near-human level audiobook out of it. I'm so
| excited.
| aedocw wrote:
| Give https://github.com/aedocw/epub2tts a look, the latest
| update enables use of MS Edge cloud-based TTS so you don't
| need a local GPU and the quality is excellent.
| jurimasa wrote:
| I think this is creepy and dangerous as fuck. Not worth the
| trouble it will be.
| CamperBob2 wrote:
| Other sites beckon.
| _zoltan_ wrote:
| you're gonna be REALLY surprised out there in the real
| world.
| aftbit wrote:
| I want to try chaining XTTS2 with something like RVCProject.
| The idea is to generate the speech in one step, then clone a
| voice in the audio domain in a second step.
| joshspankit wrote:
| Does anyone know which local models are doing the "opposite":
| Identify a voice well enough to do speaker diarization across
| multiple recordings?
| Drakim wrote:
| On my wishlist would be a local model that can generate new
| voices based on descriptions such as "rough detective-like hard
| boiled man" or "old fatherly grampa"
| mattferderer wrote:
| You might be interested in this cool app that Microsoft made
| that I don't think I've seen anyone talk about anywhere
| called Speech Studio. https://speech.microsoft.com/
|
| I don't recall their voices being the most descriptive but
| they had a lot. They also let layout a bunch of text & have
| different voices speak each line just like a movie script.
| satvikpendem wrote:
| Whisper can do diarization but not sure it will "remember" the
| voices well enough. You might simply have to stitch all the
| recordings together, run it through Whisper to get the diarized
| transcript, then process that how you want.
| beardedwizard wrote:
| Whisper does not support diarization. There are a number of
| projects that try to add it.
| Teleoflexuous wrote:
| Whisper doesn't, but WhisperX
| <https://github.com/m-bain/whisperX/> does. I am using it right
| now and it's perfectly serviceable.
|
| For reference, I'm transcribing research-related podcasts,
| meaning speech doesn't overlap a lot, which would be a problem
| for WhisperX from what I understand. There's also a lot of
| accents, which are straining on Whisper (though it's also doing
| well), but surely help WhisperX. It did have issues with
| figuring out the number of speakers on it's own, but that
| wasn't a problem for my use case.
| joshspankit wrote:
| WhisperX does diarization, but I don't see any mention of it
| fulfilling my ask which makes me think I didn't communicate
| it well.
|
| Here's an example for clarity:
|
| 1. AI is trained on the voice of a podcast host. As a side
| effect it now (presumably) has all the information it needs
| to replicate the voice
|
| 2. All the past podcasts can be processed with the AI
| comparing the detected voice against the known voice which
| leads to highly-accurate labelling of that person
|
| 3. Probably a nice side bonus: if two people with different
| registers are speaking over each other the AI could separate
| them out. "That's clearly person A and the other one is
| clearly person C"
| c0brac0bra wrote:
| You can check out PicoVoice Eagle (paid product):
| https://picovoice.ai/docs/eagle/
|
| You pass N number of PCM frames through their trainer and
| once you reach a certain percentage you can extract an
| embedding you can save.
|
| Then you can identify audio against the set of identified
| speakers and it will return percentage matches for each.
| c0brac0bra wrote:
| Picovoice says they do this but it's a paid product. It
| supposedly runs on the device but you still need a key and have
| to pay per minute.
| lordofgibbons wrote:
| I've noticed that all TTS systems have a "metalic" sound to them.
| Can this be fixed automatically using some kine of post-
| processing?
| huytersd wrote:
| Try cutting out some of the highs?
| muglug wrote:
| It's funny how a bunch of models use Musk's voice as a proof of
| their quality, given how disjointed and staccato he sounds in
| real life. Surely there are better voices to imitate.
| iinnPP wrote:
| Proving the handling of uncommon speech is definitely a great
| example to use alongside the other common and uncommon speech
| examples on the page.
| ianschmitz wrote:
| Especially with all of the crypto scams using Elon's voice
| tonnydourado wrote:
| I might be missing something, but what are the non-questionable,
| or at least non-evil, uses of this technology? Because every
| single application I can think of is fucked up: porn, identity
| theft, impersonation, replacing voice actors, stealing the
| likeness of voice actors, replacing customer support without
| letting the customers know you're using bots.
|
| I guess you could give realistic voices to people that lost their
| voices by using old recordings, but there's no way that this is a
| market that justify the investment.
| swores wrote:
| What about for remembering lost loved ones? There are dead
| people I would love to hear talk again, even if I know it's not
| their personality talking just their voice (and who knows,
| maybe with LLM training on a single person it could even be
| roughly their personality, too).
|
| I can imagine a fairly big market of both people setting it up
| before they die, with maybe a whole load of written content and
| a schedule of when to have it read in future, and people who've
| just lost someone, and want to recreate their voice to help
| remember it.
| tonnydourado wrote:
| > I can imagine a fairly big market (...)
|
| I can't, and if I could, I think this would be fairly
| dystopian. Didn't black mirror have an episode about
| something similar? I vaguely remember an Asimov/Arthur C.
| Clark short story about the implications of time travel (ish)
| tech in a similar context. Sounds like a case of "we've build
| the torment nexus from classic sci-fi novel 'do not build the
| torment nexus'"
| dotancohen wrote:
| Jack Crusher did something similar for Wesley.
| grugagag wrote:
| We already have ways to preserve the voices of people past
| their lives. Cloning their voices and writing things in their
| names is not only wrong but deceptive.
| wdb wrote:
| You can use it to easily fix voice overs on you videos without
| needing to re-record etc.
| tonnydourado wrote:
| Reasonable, but I'm skeptical of the market
| CuriouslyC wrote:
| Text to speech is very close to being able to replace voice
| actors for a lot of lower budget content. Voice cloning will
| let directors and creators get just the sound they want for
| their characters, imagine being able to say "I want something
| that sounds like Harrison Ford with a French accent." Of
| course, there are going to be debates about how closely you can
| clone someone's voice/diction/etc, both extremes are wrong -
| perfect cloning will hurt artists without bringing extra value
| to directors/creators, but if we outlaw things that sound
| similar the technology will be neutered to uselessness.
| tonnydourado wrote:
| That's basically replacing voice actors and stealing their
| likeness: both are arguably evil, and mentioned. So, I
| haven't missed them.
|
| P.S.: "but what about small, indie creators" that's not who's
| gonna embrace this the most, it's big studios, and they will
| do it to fuck over workers.
| CuriouslyC wrote:
| As someone involved in the AI creator sphere, that's a very
| cold take. Big studios pay top shelf voice talent to create
| the best possible experience because they can afford it. Do
| you think Blizzard is using AI to voice
| Diablo/Overwatch/Warcraft? Of course not. On the other
| hand, there are lots of small indie games being made now
| that utilize TTS, because the alternative is no voice, the
| voice of a friend or a very low quality voice actor.
|
| Do I want to have people making exact clones of voice
| actors? No. The problem is that if you say "You can't get
| 90% close to an existing voice actor" then the technology
| will be able to create almost no human voices, it'll
| constantly refuse like gemini, even when the request is
| reasonable. This technology is incredibly powerful and
| useful, and we shouldn't avoid using it because it'll force
| a few people to change careers.
| tonnydourado wrote:
| Have you seen how big studios treat vfx artists? They
| absolutely will replace voice actors with AI.
|
| Also:
|
| > This technology is incredibly powerful and useful
|
| At what, exactly? The only "useful" case you presented is
| "actually, replacing voice actors with AI isn't so bad".
| CuriouslyC wrote:
| You want a world where only the rich can create beautiful
| experiences. You're either rich or short sighted.
|
| Edit: If you've got a cadre of volunteer voice actors
| that don't suck hidden somewhere, you need to share
| buddy. That's the only way your comments make sense.
| tonnydourado wrote:
| I don't know what else to tell you, I just think people
| deserve to be paid for the work they do.
|
| Your vision of a world where anyone can create voice for
| their projects for cheap CAN NOT exist without someone
| getting exploited. Nor is it sustainable, really.
|
| You said they this world would be worth some people
| losing their careers, but what do we gain? More
| games/audiobooks of questionable quality? Is this really
| worth fucking a whole profession over?
| CuriouslyC wrote:
| We agree that people should be paid for the work that
| they *DO*. Your view smacks of elitism, and voice actors
| don't have any more right to be able to make decent money
| peddling their voice than indie game devs have to peddle
| games with synthetic voices.
| tonnydourado wrote:
| Your view smacks of contempt for workers, particularly in
| the arts. Specially the emphasis on "do", as if voice
| actors don't actually work, and just live of royalties or
| something. The kind of worldview that the rich and the
| delusioned working poor tend to share.
| amarant wrote:
| Professions disappear, it's a natural side effect of
| progress. Stablehands aren't really that common anymore,
| because most people drive cars instead of horses.
|
| I really hope we can deprecate a whole bunch of
| professions related to fossil fuels, including coal
| miners and oil drillers etc.
|
| I sympathise with the people working in those
| professions, I do, but times change and professions come
| and go, and I don't buy the argument that we should stop
| inventing new stuff because it might outcompete people.
|
| As for positive uses of this technology, it might be used
| to immortalise a voice actor. For example Sir David
| Attenborough probably won't be around forever, but thanks
| to this technology, his iconic voice might be!
| wsintra2022 wrote:
| I made an e book of Carl Rogers narrated by David
| Attenborough, turned out decent, I used coquai who sadly
| have closed with all my API credits
| Osmose wrote:
| You have a narrow view of what a beautiful experience is.
| It does not require professional-level voice acting.
|
| It is not unfair that, in order to have voice acting, you
| must have someone perform voice acting. You don't have
| the natural right to professional-level voice acting for
| free, nor do you need it to create beautiful things.
|
| The tech is simply something that may be possible, and it
| has tradeoffs, and claiming that it's an accessibility
| problem does not grant you permission to ignore the
| tradeoffs.
| ben_w wrote:
| > You don't have the natural right to professional-level
| voice acting for free
|
| I also don't have the natural right to work as a
| professional-level voice actor.
|
| "Natural rights" aren't really a thing, the phrase is a
| thought-terminating cliche we use for the rhetorical
| purpose of saying something is good or bad without having
| to justify it further.
|
| > The tech is simply something that may be possible, and
| it has tradeoffs, and claiming that it's an accessibility
| problem does not grant you permission to ignore the
| tradeoffs.
|
| A few times as a kid, I heard the meme that the American
| constitution allows everything then tells you what's
| banned, the French one bans everything then tells you
| what's allowed, and the Soviet one tells you nothing and
| arrests you anyway.
|
| It's not a very accurate meme, but still, "permission" is
| the wrong lens: it's allowed until it's illegal. You want
| it to be illegal to replace voice actors with synthetic
| voices, you need to campaign to make it so as this isn't
| the default. (Unlike with using novel tech for novel
| types of fraud, where fraud is already illegal and new
| tech doesn't change that).
| Riverheart wrote:
| "You want a world where only the rich can create
| beautiful experiences. You're either rich or short
| sighted."
|
| Being rich to create a beautiful experience is neither
| required nor does it require a synthetic voice to
| achieve.
|
| It does require effort and being rich can reduce that
| effort for sure.
| ceejayoz wrote:
| > Do you think Blizzard is using AI to voice
| Diablo/Overwatch/Warcraft? Of course not.
|
| Do you think Blizzard won't when the tech gets cheap and
| good enough?
| CuriouslyC wrote:
| Probably not, because the voice actors are a community
| draw. In fact, one of the top threads in the overwatch
| subreddit right now is pictures of all the voice actors.
| They go to cons and interact with fans and they don't
| cost so much that losing that value to save a few bucks
| is worth it.
| Osmose wrote:
| The lightness with which you treat forcing tens of
| thousands of people to change their career is absurd.
| Indie games are hardly suffering for a lack of voice
| acting, even if you only look at it from a market
| perspective and ignore that voice acting is a creative
| interpretation and not simply reading the words the way
| the director wants.
|
| Yes, we should avoid using it because it will upend the
| lives of a significant amount of artists for the primary
| benefit of "some indie games will have more voice acting
| and big game companies will be able to save money on
| voice actors". That's not worth it, how could you think
| it is?
| ben_w wrote:
| > The lightness with which you treat forcing tens of
| thousands of people to change their career is absurd.
|
| _Only_ tens of thousands? Cute. For most of the 2010s, I
| was expecting self-driving cars to imminently replace
| truck drivers, which is a few millions in the US alone
| and I think around 40-45 million worldwide. I still do
| expect AI to replace humans for driving, I just don 't
| know how long it will take. (I definitely wasn't
| expecting "creative artistry" to be an easier problem
| than "don't crash a car", I didn't appreciate that nobody
| minds if even 90% of the hands have 6 fingers while
| everyone minds if a car merely equals humans by failing
| to stop in 1 of every (3.154e7 seconds per year * 1.4e9
| vehicles / 30000 human driving fatalities per year ~=
| 1.47e+12) seconds of existence).
|
| Almost every nation used to be around 90% farm workers,
| now it's like 1-5% (similar numbers to truckers) and even
| those are scared of automation; the immediate change was
| to factory jobs, but those too have shifted into service
| roles because of automation of the former, and the rest
| are scared of automation (and outsourcing).
|
| Those service-sector roles? "Computer" used to be a job;
| Graphical artists are upset about Stable Diffusion;
| Anyone working with text, from Hollywood script writers
| to programmers to lawyers, is having to justify their own
| wages vs. an LLM (for now, most of us are winning this
| argument; but for how long?)
|
| We get this wrong, it's going to be a disaster; we get it
| right, we're all living better the 0.1%.
|
| > Indie games are hardly suffering for a lack of voice
| acting, even if you only look at it from a market
| perspective and ignore that voice acting is a creative
| interpretation and not simply reading the words the way
| the director wants.
|
| I tried indie game development for a bit. I gave up with
| something like PS1,000 in my best year. (You can probably
| double that to account for inflation since then).
|
| This is because the indie game sector is also not
| suffering from a lack of developer talent, meaning
| there's a lot of competition that drives prices below the
| cost of living. Result? Hackathons where people compete
| for the fun of it, not for the end product. Those
| hackathons are free to say if they do or don't come with
| rules about GenAI; but in any case, they definitely come
| with no budget.
|
| > Yes, we should avoid using it because it will upend the
| lives of a significant amount of artists for the primary
| benefit of "some indie games will have more voice acting
| and big game companies will be able to save money on
| voice actors". That's not worth it, how could you think
| it is?
|
| A few hours ago I was in the Deutsches Technikmuseum;
| there's a Jacquard Loom by the cafe: https://technikmuseu
| m.berlin/ausstellungen/dauerausstellunge...
|
| The argument you give here is much the same argument used
| against that machine, back in the day:
| https://spectrum.ieee.org/the-jacquard-loom-a-driver-of-
| the-...
|
| Why do you think those textile workers lost the argument?
|
| And to pre-empt what I think is a really obvious counter,
| I would also add that the transition we face must be
| handled with care and courtesy to the economic fears --
| to all those who read my comment and think "and therefore
| this will be easy and we should embrace it, just dismiss
| the nay-sayers as the Luddites they are": why do you
| think Karl Marx wrote the Communist Manifesto?
| waterhouse wrote:
| Suppose all existing voice actors, and, to be maximally
| generous, everyone who had spent >1 year training to be a
| voice actor, was given a pension for some years, paying
| them the greater of their current income or some average
| voice actor income. And then there would be no limits on
| using AI voices to substitute for voice actors.
|
| Would you be happy with that outcome, or do you have
| another objection?
| allannienhuis wrote:
| I don't disagree with the thought that large companies are
| going to try to use these technologies too, with typical
| lack of ethics in many cases.
|
| But some of this thinking is a bit like protesting the use
| of heavy machinery in roadbuilding/construction, because it
| displaces thousands of people with shovels. One difference
| with this type of technology is that the means to use it
| doesn't require massive amounts of capital like the heavy
| machinery example, so more of those shovel-weilders will be
| able to compete with those that are only bringing captial
| to the table.
| tonnydourado wrote:
| I'm not saying that this should be forbidden or
| something. I just wonder what is the motivation for the
| people pitching and actually developing this. I'm all for
| basic, non-profit-driven, research, but at some point you
| gotta ask yourself "what am I helping create here?"
| CrazyStat wrote:
| Saying something is evil would seem to suggest that you
| think it should be forbidden. Maybe you should choose a
| different word if that's not your intention.
| ben_w wrote:
| I disagree on three of your points.
|
| It is creating a new and fully customisable voice actor
| that perfectly matches a creative vision.
|
| To the extent that a skilled voice actor can already blend
| existing voices together to get, say, French Harrison Ford,
| for it to be evil for a machine to do it would require it
| to be evil for a human to do it.
|
| Small indie creators have a budget of approximately
| nothing, this kind of thing would allow them to voice _all_
| NPCs in some game rather than just the main quest NPCs.
| (And that 's true even in the absence of LLMs to generate
| the flavour text for the NPCs so they're not just repeating
| "...but then I took an arrow to the knee" as generic
| greeting #7 like AAA games from 2011).
|
| Big studios _may also_ use this for NPCs to the economic
| detriment of current voice actors, but I suspect this will
| be a tech which leads to "induced demand"[0] -- though
| note that this can also turn out _very badly_ and isn 't
| always a good thing either:
| https://en.wikipedia.org/wiki/Cotton_gin
|
| [0] https://en.wikipedia.org/wiki/Induced_demand
| allannienhuis wrote:
| I can think that better quality audio content generated from
| text would be a killer application. As someone else mentioned,
| pipe in an epub, output an audiobook or video game content.
| With additional tooling (likely via ai/llm analysis), this
| could enable things like dramatic storytelling with specific
| character voices and dynamics interpreted from the content of
| the text.
|
| I can see it empowering solo creators in similar ways that
| modern music tools enable solo or small-budget musicians today.
| latexr wrote:
| > pipe in an epub, output an audiobook or video game content.
|
| That falls into "replacing voice actors", mentioned by the
| OP.
| blackqueeriroh wrote:
| No, it really doesn't. There are thousands of very smart
| and talented creators without the budget to hire voice
| actors. This lets them get a start. AI voices let you lower
| the barrier to entry, but they won't replace most voice
| actors because the higher you go up the stack, the more the
| demand for real actors will also go up because AI voices
| aren't anywhere near being able to replace real voice
| actors.
| tonnydourado wrote:
| As another reply put, I'm very skeptical that the
| benefits for small content creators will offset the
| damaged to society as a whole, from increased fraud and
| harassment.
| latexr wrote:
| > AI voices let you lower the barrier to entry, but they
| won't replace most voice actors because the higher you go
| up the stack, the more the demand for real actors will
| also go up
|
| That is as absurd as saying LLMs are increasing the
| demand for writers.
|
| > because AI voices aren't anywhere near being able to
| replace real voice actors.
|
| Even if that were true--which it is not; the current crop
| is more than adequate to read long texts--it assumes the
| technology has reached its limit, which is equally
| absurd.
| albert_e wrote:
| What if I want to listen to my notes in my own voice
|
| Or my favorite books in my own voice.
|
| Or my lecture notes in my professor's voice.
| devinprater wrote:
| Or, when it gets fast enough, someone could have their own
| personal dub of video games (BlazBlue Central Fiction) or TV
| shows and such.
| mostrepublican wrote:
| I used it to translate a short set of tv shows that were only
| available in Danish with no subtitles in any other language and
| made them into English for my personal watching library.
|
| The episodes are about 95% just a narrator with some background
| noises.
|
| Elevenlabs did a great job with it and I cranked through the 32
| episodes (about 4 mins each) relatively easily.
|
| There is a longer series (about 60 hours) only in Japanese that
| I want to do the same thing for. But don't want to spend
| Elevenlabs prices to do.
| ukuina wrote:
| OpenAI TTS is very competitively priced: $15/1M chars.
| kajecounterhack wrote:
| I like the idea of cloning my own voice and having it speak in
| a foreign language
| SunlitCat wrote:
| Maybe having better real time conversations in computer games.
| Like game characters saying your name in voiceovers.
| AnonC wrote:
| > what are the non-questionable, or at least non-evil, uses of
| this technology?
|
| iPhone Personal Voice [1] is one. It helps people who are
| physically losing their voice and the ones around them to still
| have their voice in a different way. Apple takes long voice
| samples of various texts for this though.
|
| [1]: https://www.youtube.com/watch?v=ra9I0HScTDw
| tonnydourado wrote:
| That's kinda what I was thinking on the second paragraph.
| Still, gotta be a small market.
| IMTDb wrote:
| Non robotic screen readers for blind people
| tonnydourado wrote:
| That would be non-evil, sure. But I wonder if blind people
| even want it? They're already listening to screen readers at
| insane speeds, up to 6-8x, I think. Do they even care that it
| doesn't sound "realistic"?
| blackqueeriroh wrote:
| Well, I'm sure the blind readers of HN (which I am certain
| exist) can answer this question, and you, a sighted person,
| don't need to even wonder from your position of unknowing.
| tonnydourado wrote:
| I mean, I explicitly used "wonder" because I don't wanna
| assume about blind people's experiences and needs. What
| else should I have done so you wouldn't come in kicking
| me in the nuts?
| SamPatt wrote:
| In this thread there's a bunch of "non-evil" responses,
| and your replies are all "I'm skeptical" or just
| dismissing them outright.
|
| It appears from the outside that you've decided this is
| Officially Bad technology and aren't genuinely seeking
| evidence otherwise.
| tonnydourado wrote:
| You're assuming worse of me than I'm assuming of the
| technology.
|
| There's almost no reply here with a use that is a) not
| somewhat bad and b) has enough of an upside to compensate
| the downsides.
|
| Except maybe this one, but I do know enough about
| accessibility to know how blind people generally use
| computers, which is why I asked the question.
| Mkengine wrote:
| I don't know how stressful my life will be then, but I thought
| about reading to my kids later and creating audiobooks with my
| voice for them, for when I am traveling for work, so they can
| still listen to me "reading" to them.
| bigcoke wrote:
| AI girlfriend... ok I'm done.
| lenerdenator wrote:
| It's 2024. Are nerds still trying to turn any technology of
| sufficient ability into Kelly LeBrock?
| bigcoke wrote:
| this is going to be a real thing for gen z, but replace
| kelly with any girl from anime
| lenerdenator wrote:
| Jeeze, I can't imagine why women feel so alienated from
| the tech industry.
|
| It's almost as if any time some sort of way to make
| computers more human-like emerges, the first thing a
| subset of the men in the space do is think "How can I use
| this to make a woman who has _absolutely_ no function
| other than my emotional, practical, and physical
| gratification? "
| amenhotep wrote:
| Humans in desiring deep emotional and sexual connections
| with people of their desired gender and being driven to
| weird behaviours when they can't achieve it in the way
| you personally approve of shock
| lenerdenator wrote:
| Then work on it. Ask friends for feedback. Go to therapy.
| Have some damned introspection instead of just reducing
| 51% of the people on the planet to a bangmaid.
| lenerdenator wrote:
| > there's no way that this is a market that justify the
| investment.
|
| It's not just worth justifying investment. You can make just
| about anything worth the investment as measured by a 90-day
| window of fiscal reporting. H!tmen were a wildly profitable
| venture for La Cosa Nostra.
|
| It's about not justifying the societal _risk_.
| YoshiRulz wrote:
| It could be used by people who can write English fluently, but
| are slow at speaking it, as a more personal form of text-to-
| speech.
|
| Personally, I'm eager to have more control over how my voice
| assistant sounds.
| Zambyte wrote:
| Similarly, a real-time voice to voice translation system that
| uses the speakers voice would be really cool.
| layer8 wrote:
| It enables to use your favorite audiobook reader's voice for
| all your TTS needs. E.g. you can have HN comments read to you
| by Patrick Steward, or by the Honest Trailers voice. Maybe you
| find that questionable? ;)
| zdragnar wrote:
| So, replacing voice actors with unpaid clones of their
| voices, effectively stealing their identity.
|
| The range of use goes from totally harmless fun to downright
| evil.
| RyanCavanaugh wrote:
| The existence of Photoshop doesn't mean that you can put
| Kobe Bryant on a Wheaties box without paying him. There's
| no reason that a voice talent's voice can't be subject to
| the same infringement protections as a screen actor's or
| athlete's likeness.
| popalchemist wrote:
| You absolutely can put Kobe on a Wheaties box without
| problems legally, IF you do not sell it. That's "fair
| use." It has not been tested in court yet, but precedent
| seems to suggest that creating voice clones for private
| use is also fair use, ESPECIALLY if that person is a
| celebrity, because privacy rights are limited for
| celebrities.
| layer8 wrote:
| If I take pictures of someone and hang my home with AI-
| generated copies of those pictures, I'm not stealing their
| identity.
| johncalvinyoung wrote:
| Utterly questionable.
| wongarsu wrote:
| Organized crime should be happy to invest in that. Especially
| the "indian scam callcenter" type of crime.
| tompetry wrote:
| I have the same concerns generally. But one non-evil popped
| into my head...
|
| My dad passed away a few months ago. Going through his things,
| I found all of his old papers and writings; they have great
| meaning to me. It would be so cool to have them as audio files,
| my dad as the narrator. And for shits, try it with a British
| accent.
|
| This may not abate the concerns, but I'm sure good things will
| come too.
| block_dagger wrote:
| Serious question: is this a healthy way to treat ancestors?
| In the future will we just keep grandma around as an AI
| version of her middle aged self when she passes?
| tompetry wrote:
| Fair question. People have kept pictures, paintings, art,
| belongings, etc of their family members for countless
| generations. AI will surely be used to create new ways to
| remember loved ones. I think that is a big difference than
| "keeping around grandma as an AI version of herself", and
| pretending they are still alive, which I agree feels
| unhealthy.
| Narishma wrote:
| There's a Black Mirror episode about something like that,
| though I don't remember the details.
| oli-g wrote:
| Yup, "Be Right Back", S2E1
|
| And possibly another one, but that would be a spoiler
| GTP wrote:
| I remember a journalist actually doing it, but just the
| AI part of course, not the robot.
| gremlinsinc wrote:
| it worked for super man, he seemed well adjusted after
| talking to his dead parents.
| mynameisash wrote:
| I think everyone's entitled to their opinion here. As for
| me, though: my brother died at 10 years old (back in the
| 90s). While there are some home videos with him talking,
| it's never for more than a few seconds at a time.
|
| Maybe a decade ago, I came across a cassette tape that he
| had used to record himself reading from a book for school -
| several minutes in duration.
|
| It was incredibly surprising to me how much he sounded like
| my older brother. It was a very emotional experience, but
| personally, I can't imagine using that recording to
| bootstrap a model whereby I could produce more of his
| "voice".
| hypertexthero wrote:
| Not sure if this is related to this tech, but I think it is
| worthwhile: The Beatles - Now And Then - The Last Beatles
| Song (Short Film)
|
| https://www.youtube.com/watch?v=APJAQoSCwuA
| bdcravens wrote:
| The first couple I've come up with are training courses at
| scale, or converting videos with accents you have a hard time
| understanding to one you can (no one you'll understand better
| than yourself)
| accrual wrote:
| A long term goal of mine is to have a local LLM trained on my
| preferences and with a very long memory of past conversations
| that I could chat with in real time using TTS. It would be
| amazing to go on a walk with Airpods and chat with it, ask
| questions, learn about topics, etc.
| willsmith72 wrote:
| I do that already with the chatgpt mobile app, but not with
| my own voice.
|
| I'd like it if there were more (and non-american) voice
| options, but I don't think I'd ever want it to be my voice
| I'm hearing back.
| accrual wrote:
| Yeah, I wouldn't necessarily want it to be my own voice
| either, but it would be very cool to make it be the voice
| of someone I enjoy listening to. :)
| victorbjorklund wrote:
| Why is replacing voice actors evil? How is it worse than
| replacing any other job using a machine/software?
| buu700 wrote:
| Agreed. I think the framing of "stealing" is a needlessly
| pessimistic prediction of how it might be used. If a person
| owns their own likeness, it would be logical to implement
| legal protections for AI impersonations of one's voice. I
| could imagine a popular voice actor scaling up their career
| by using AI for a first draft rendering of their part of a
| script and then selectively refining particular lines with
| more detailed prompts and/or recording them manually.
|
| This raises a lot of complicated issues and questions, but
| the use case isn't inherently bad.
| machomaster wrote:
| The problem is not about replacing actors with technology. It
| is about replacing the particular actors with their computer-
| generated voice. It's about likeness-theft.
| spyder wrote:
| Huh? Replacing human labor with machine is evil? You wouldn't
| even able to post this comment without that happening, because
| computers wouldn't exists or we wouldn't have time for that
| because many of us would work on farms to produce enough food
| without the use of human-replacing technologies.
|
| In a similar way as machines allowed to produce abundance of
| food with less labor, the voice AI combined with AI translation
| can make information more accesible for the world. Voice actors
| wouldn't be able to voice act all the useful information in the
| world, (especially for the more niche topics and for the
| smaller languages) because it wouldn't worth to pay them and
| humans are also slower to than machines. We are not far from
| almost realtime voice translation from any language to any
| other one. Sure, we can do it with text-only translation, but
| voice makes it more accessible for lot of people. ( For example
| between 5-10% of the world has dislexya. )
| albert_e wrote:
| If I am learning new content I can make my own notes and
| convert them into an audiobook for my morning jog or office
| commute using my own voice.
|
| If I am a content creator I can generate content more easily by
| letting my AI voice narrate my slides say. Yes that is cheap
| and lower quality than a real narrator who can deliver more
| effective real talks ...but there is a long tail of mediocre
| content on every topic. Who cares as long as I am having fun,
| sharing stuff, and not doing anything illegal or wrong.
| pksebben wrote:
| There's a huge gap in uses where listenable, realistic voice is
| required, but the text to be spoken is not predetermined. Think
| AI agents, NPCs in dynamically generated games, etc. These
| things are currently not really doable with the current crop of
| TTS because either they take too long to run or they sound
| awful.
|
| I think the bulk of where this stuff will be useful isn't
| really visible yet b/c we haven't had the tech to play around
| with enough.
|
| There is also certainly a huge swath of bad-actor stuff that
| this is good for. I feel like a lot of the problems with modern
| tech falls under the umbrella of "We're not collectively mature
| enough to handle this much power" and I wish there were a
| better solution for all of that.
| gremlinsinc wrote:
| eh, you mean the solution isn't, so here's even more power...
| see you next week!
| paczki wrote:
| The ability to use my own voice in other languages so I can do
| localization on my own youtube videos would be huge.
|
| With game development as well, being able to be my own voice
| actor would save me an immense amount of money that I do not
| have and give me even more creative freedom and direction of
| exactly what I want.
|
| It's not ready yet, but I do believe that it will come.
| Capricorn2481 wrote:
| People are already doing this and it was hugely controversial
| in The Finals
| dougmwne wrote:
| It seems like it would be great for any kind of voiceover work
| or any recorded training or presentation. If you want to
| correct a mis-speak or add some information, instead of re-
| recording the entire segment, you could seamlessly update a few
| words or sentences.
| thatguysaguy wrote:
| I'm 100% going to clone my voice and use it on my discord bot.
| andrewmcwatters wrote:
| I want to preserve samples of my voice as I age so that when
| voice replication technology improves in the future, I can hear
| myself from a different time of my life in ways that are not
| prerecorded.
|
| I would also like to give my children this as a novelty of
| preserved family history so if I so desire, I can have fun with
| them by letting them hear me from different ages.
| thatguysaguy wrote:
| To think of non-evil versions just consider cases where right
| now there's no voice actor to replace, but you could add a
| voice. E.g. indie games.
| drusepth wrote:
| Super-niche use-case: our game studio prototyped a multiplayer
| horror game where we played with cloning player voices to be
| able to secretly relay messages to certain players as if it
| came from one of their team-mates (e.g. "go check out below
| deck" to split a pair of players up, or "I think Bob is trying
| to sabotage us" to sew inter-player distrust, etc).
|
| Less-niche use-case: if you use TTS for voice-overs and/or NPC
| dialogue, there can still be a lot of variance in speech
| patterns / tone / inflections / etc when using a model where
| you've just customized parameters for each NPC -- using a
| voice-clone approach, upon first tests, seems like it might
| provide more long-term consistency.
|
| Bonus: in a lot of voiced-over (non-J)RPGs, the main character
| is text-only (intentionally not voiced) because they're often
| intended to be a self-insert of the player (compared to JRPGs
| which typically have the player "embody" a more fleshed-out
| player with their own voice). If you really want to lean into
| self-insert patterns, you could have a player provide a short
| sample of their voice at the beginning of the game and use that
| for generating voice-overs for their player character's
| dialogue throughout the game.
| Terr_ wrote:
| The idea of a personalized protagonist voice is interesting,
| but I'd worry about some kind of uncanny valley where it
| sounds like myself but is using the wrong word-choices or
| inflections.
|
| Actually, getting it to sound "like myself" in the first
| place is an extra challenge! For many people even actual
| recordings sound "wrong", probably because your self-
| perception involves spoken sound being transmitted through
| your neck and head, with a different blend of frequencies.
|
| After that is solved, there's still the problem of bystanders
| remarking: "Is that supposed to sound like you? It doesn't
| sound like you."
| sunshine_reggae wrote:
| You forgot plausible deniability, AKA "I never said that".
| starwin1159 wrote:
| Cantonese can't be imitated
| paulryanrogers wrote:
| Why?
| Zambyte wrote:
| How do people learn it?
| trollied wrote:
| Look up iPhone "personal voice". People don't seem to know about
| it.
| burcs wrote:
| There's a "vocal fry" aspect to all of these voice cloning tools,
| a sort of uncanny valley where they can't match tones correctly,
| or get fully away from this subtle Microsoft Sam-esque
| breathiness to their voice. I don't know how else to describe it.
| blackqueeriroh wrote:
| Yeah, this is why I'm nowhere near worried about this replacing
| voice actors for the vast majority of work they currently get
| paid for.
| Fripplebubby wrote:
| Really interesting! Reading the paper, it sounds like the core of
| it is broken into two things:
|
| 1. Encoding speech sounds into an IPA-like representation,
| decoding IPA-like into target language
|
| 2. Extracting "tone color", removing it from the IPA-like
| representation, then adding it back in into the target layer
| (emotion, accent, rhythm, pauses, intonation)
|
| So as a result, I am a native English speaker, but I could hear
| "my" voice speaking Chinese with similar tone color to my own! I
| wonder, if I recorded it, and then did learn to speak Chinese
| fluently, how similar it would be? I also wonder whether there is
| some kind of "tone color translator" that is needed to translate
| the tone color markers of American English into the relevant ones
| for other languages, how does that work? Or is that already
| learned as part of the model?
| Havoc wrote:
| Tried it locally - can't get anywhere near the clone quality of
| the clips on their site.
|
| Not even close. Perhaps I'm doing something wrong...
| pantsforbirds wrote:
| I wonder if in < 5 years I can make a game with a local LLM + AI
| TTS to create realistic NPCs. With enough of these tools I think
| you could make a very cool world-sim type game.
| rcarmo wrote:
| I'm much more interested in the dismal possibility of using
| this in politics. Nation state actors, too.
| treprinum wrote:
| Did this just obliterate ElevenLabs?
| htrp wrote:
| Eleven's advantage is being able to have consistent outputs
| through high quality training data.
| akashkahlon wrote:
| So every novel is a movie soon by the author itself using Sora
| and with Audio buys from all the suitable actors
| rcarmo wrote:
| This can't really do a convincing Sean Connery yet.
| _zoltan_ wrote:
| just by more NVDA. :-)
| Multicomp wrote:
| I hope so, then those of us who want to tell a story (writers,
| whether comic or novellist or short story or screenplay or
| teleplay or whatever) will be able to compete more and more on
| quality and richness of the story copy and content to the
| audience, not with the current comparative advantage of media
| choices being made for most storytellers based on difficulty to
| render.
|
| Words on page are easier than still photos, which are easier
| than animation, which are easier than live-action TV, which are
| easier than IMAX movies etc.
|
| If we move all of the rendering of the media into automation,
| then its just who can come up with the best story content, and
| you can render it whatever way you like: book, audiobook,
| animation, live action TV, web series, movie, miniseries,
| whatever you like.
|
| Granted - the AI will come for us writers to, it already is in
| some cases. Then the Creator Economy itself will be consumed
| with eventually becoming 'who can meme the fastest' on an
| industrial scale for daily events on the one end, and who has
| taken the time to paint / playact / do rendering out in the
| real world.
|
| But I sure would love to be able to make a movie out of my
| unpublished novel, and realistically today, that's impossible
| in my lifetime. Do I want the entire movie-making industry to
| die so I and others like me can have that power? No. But if the
| industry is going to die / change drastically anyways due to
| forces beyond my control, does that mean I'm not going to take
| advantage of the ability? Still no.
|
| IDK. I don't have all the answers to this.
|
| But yes, this (amazingly accurate voice cloner after a tiny
| clip?! wow) product is another step towards that brave new
| world.
| thorum wrote:
| OpenVoice currently ranks second-to-last in the Huggingface TTS
| arena leaderboard, well below alternatives like styletts2 and
| xtts2:
|
| https://huggingface.co/spaces/TTS-AGI/TTS-Arena
|
| (Click the leaderboard tab at the top to see rankings)
| carbocation wrote:
| I would like to see the new VoiceCraft model on that list
| eventually (weights released yesterday, discussion at [1]).
|
| 1 = https://news.ycombinator.com/item?id=39865340
| KennyBlanken wrote:
| Having gone through almost ten rounds of the TTS Arena, XTT2
| has tons of artifacts that instantly make it sound non-human.
| OpenVoice doesn't.
|
| It wouldn't surprise me if people recognize different
| algorithms and purposefully promote them over others, or alter
| the page source with a userscript to see the algorithm before
| listening and click the one they're trying to promote. Looking
| at the leaderboard, it's obvious there's manipulation going on,
| because Metavoice is highly ranked but generates absolutely
| terrible speech with extremely unnatural pauses.
|
| Elevenlabs was scarily natural sounding and high quality; the
| best of the ones I listened to so far. Pheme's speech overall
| sounds really natural, but has terrible sound quality, which is
| probably why it's ranked so well. If Pheme could be higher
| quality audio, it'd probably match Elevenlabs.
| Jackson__ wrote:
| As someone who has used the arena maybe ~3 times, the subpar
| voice quality in the demo linked immediately stood out to me.
| ckl1810 wrote:
| Is there a benchmark for compute needed? Curious to see if
| anyone is building / has built a Zoom filter, or Mobile app,
| whereby I can speak English, and out comes Chinese to the
| listener.
| c0brac0bra wrote:
| I'd like to see Deepgram Aura on here.
| lacoolj wrote:
| That season of 24 is coming true
| yogorenapan wrote:
| Note: the open source version is watered down compared to their
| commercial offering. Tried both out and the quality doesn't come
| close to
| ckl1810 wrote:
| OpenAI vollies back:
|
| https://twitter.com/OpenAI/status/1773760852153299024
| speedbird wrote:
| Not convinced. The second reference has a slight Indian accent
| that isn't carried over into the generated samples.
|
| Training data bias?
| opdahl wrote:
| What are you talking about? I am not noticing it at all.
| chenxi9649 wrote:
| I am the most impressed by the cross-lingual voice cloning...
|
| https://research.myshell.ai/open-voice/zero-shot-cross-lingu... I
| can only speak on their Dutch -> Chinese voice cloning but it's
| better than anything else I've tried. There is basically no
| "english/dutch accent" in the chinese at all. Where as the
| ElevenLabs Chinese voice(cloning or not) is so much worse...
___________________________________________________________________
(page generated 2024-03-29 23:01 UTC)