[HN Gopher] OpenVoice: Versatile instant voice cloning
       ___________________________________________________________________
        
       OpenVoice: Versatile instant voice cloning
        
       Author : ulrischa
       Score  : 390 points
       Date   : 2024-03-29 07:50 UTC (15 hours ago)
        
 (HTM) web link (research.myshell.ai)
 (TXT) w3m dump (research.myshell.ai)
        
       | andrewstuart wrote:
       | If someone can come up with a voice clinging product that I can
       | run on my own computer not the cloud, and if it's super simple to
       | install and use, then I'll pay.
       | 
       | I find it hard to understand why so much money is going into ai
       | and so many startups are building ai stuff and such a product
       | does not exist.
       | 
       | It's got to run locally because I'm not interested in the
       | restrictions that cloud voice cloning services impose.
       | 
       | Complete, consumer level local voice cloning = payment.
        
         | dsign wrote:
         | I couldn't agree more.
         | 
         | I've tried some of this ".ai" websites that do voice-cloning,
         | and they tend to use the following dark strategy:
         | 
         | - Demand you create a cloud account before trying.
         | 
         | - Sometimes, demand you put your credit card before trying.
         | 
         | - Always: the product is crap. Sometimes it does voice-cloning
         | sort of as advertised, but you have to wait for the training
         | and the execution in queue, because cloud GPUs are expensive
         | and they need to manage a queue because it's a _cloud_
         | prouduct. At least that part could be avoided if they shipped a
         | VST plugin one could run locally, even if it 's restricted to
         | NVidia GPUs[^2].
         | 
         | [^1]: To those who say "but the devs must get paid": yes. But
         | subscriptions miss-align incentives, and some updates are
         | simply not worth the minutes they cause in productivity lost
         | while waiting for their shoehorned installation.
         | 
         | [^2]: Musicians and creative types are used to spend a lot in
         | hardware and software, and there are inference GPUs which are
         | cheaper than some sample libraries.
        
           | andrewstuart wrote:
           | I don't mind if the software is a subscription it just has to
           | be installable and not spyware garbage.
           | 
           | Professional consumer level software like a game or
           | productivity app or something.
        
           | riwsky wrote:
           | How do you figure subscriptions misalign incentives? The
           | alternative, of selling upgrades, incentivizes devs to focus
           | on new shiny shit that teases well. I instead rather they
           | focus on making something I get value out of consistently.
        
             | dsign wrote:
             | - A one-off payment makes life infinitely simpler for
             | accounting purposes. In my jurisdiction, a software license
             | owned by the business is an asset and shows as that in the
             | balance sheet, and can be subject to a depreciation
             | schedule just as any other asset.
             | 
             | - Mental peace: if product X does what I need right now and
             | I can count that I will be able to use product X five years
             | from now to do the same thing, then I'm happy to pay a lump
             | sum that I see as an investment. Even better, I feel
             | confident that I can integrate product X in my workflows. I
             | don't get that with a subscription product on the hands of
             | a startup seeking product-market fit.
        
           | andoando wrote:
           | I made a voice cloning site. https://voiceshift.ai No login,
           | nothing required. Its a bit limited but I can add any of the
           | RVC models. Working on a feature to just upload your own
           | model.
           | 
           | I can definitely make it a local app.
        
         | smusamashah wrote:
         | Buy this one is supposed to be runnable locally. It has
         | complete instructions on Github including downloading models
         | locally and installing python setting it up and running it.
        
           | andrewstuart wrote:
           | I'm wanting to download an installer and run it - consumer
           | level software.
        
         | ddtaylor wrote:
         | I can show you how to use Bark AI to do voice cloning.
        
           | rexreed wrote:
           | What local hardware is needed to run Bark AI? What is the
           | quality? Looking for something as good or better than Eleven
           | Labs.
        
             | ddtaylor wrote:
             | It can run on CPU without much issue and takes up a few
             | gigs of RAM and will produce about in realtime. If you GPU
             | accelerate you only need about 8GB of video memory and it
             | will be at least 5X faster.
             | 
             | Out of the box it's not as good as Eleven Labs based on
             | their demos, but those are likely cherry picked. There are
             | some tunable parameters for the Bark model and most
             | consider the output high enough quality to pass into
             | something else that can do denoising.
        
           | mdrzn wrote:
           | Please do!
        
         | ipsum2 wrote:
         | How much would you pay? I can make it.
        
           | andrewstuart wrote:
           | You can't sell this cause the license doesn't allow it.
        
             | ddtaylor wrote:
             | Bark is MIT licensed for commercial use.
        
             | ipsum2 wrote:
             | Not using this model, but something similar. How much would
             | you pay?
        
               | ipsum2 wrote:
               | Based on the lack of replies, the answer appears to be
               | $0.
        
             | pmontra wrote:
             | "This repository is licensed under a Creative Commons
             | Attribution-NonCommercial 4.0 International License, which
             | prohibits commercial usage"
             | 
             | People could pay somebody for the service of setting up the
             | model on their own hardware, then use the model for non
             | commercial usage.
        
               | GTP wrote:
               | IANAL, but this looks like a grey area to me: it could be
               | argued that the person/company getting paid to do the
               | setup is using the model commercially.
        
             | GTP wrote:
             | Doesn't allow it _yet_ , but on the readme, they write
             | "This will be changed to a license that allows Free
             | Commercial usage in the near future". So someone will soon
             | be able to sell it to you.
        
         | washadjeffmad wrote:
         | I mean this at large, but I just can't get over this "sell me a
         | product" mentality.
         | 
         | You already don't need to pay; all of this is happening
         | publication to implementation, open and local. Hop on Discord
         | and ask a friendly neon-haired teen to set up TorToiSe or xTTS
         | with cloning for you.
         | 
         | Software developers and startups didn't create AGI, a whole lot
         | of scientists did. A majority of the services you're seeing are
         | just repackaging and serving foundational work using tools
         | already available to everyone.
        
           | TuringTest wrote:
           | I agree, buy playing devil's advocate, it's true that people
           | without the time and expertise to setup their own install can
           | find this packaging valuable enough to pay for it.
           | 
           | It would be better for all if, in Open Source fashion, this
           | software had a FLOSS easy-to-install packaging that provided
           | for basic use cases, and developers made money by adapting it
           | to more specific use cases and toolchains.
           | 
           | (This one is not FLOSS in the classic sense, of course. The
           | above would be valid for MIT-licensed or GPL models).
        
           | nprateem wrote:
           | You can extend that reasoning to anything, but time and
           | energy are limited
        
           | lancesells wrote:
           | The answer is convenience. Why use dropbox when you can run
           | Nextcloud? You can say the same thing about large companies.
           | Why does Apple use Slack (or whatever they use) when they
           | could build their own? Why doesn't Stripe build their own
           | data centers?
           | 
           | If I had a need for an AI voice for a project I would pay the
           | $9 a month, use it, and be done. I might have the skills to
           | set this up on my machine but it would take me hours to get
           | up to speed and get it going. It just wouldn't be worth it.
        
         | endisneigh wrote:
         | I see these types of comments all the time, but fact is folks
         | at large who wouldn't use the cloud version won't pay. The kind
         | of person who has a 4090 to run these sort of models would just
         | figure out how to do it themselves.
         | 
         | The other issue is that paying for the software once doesn't
         | capture as much of the value as a pay per use model, thus if
         | you wanted to sell the software you'd either have to say you
         | can only use it for personal use, or make it incredibly
         | expensive to account for the fact that a competitor would just
         | use it.
         | 
         | Suppose there were such a thing - then folks may complain that
         | it's not open source. Then it's open sourced, but then there's
         | no need to pay.
         | 
         | In any case, if you're willing to pay $1000 I'm sure many of us
         | can whip something up for you. Single executable.
        
         | palmfacehn wrote:
         | XTTS2 works well locally. Maybe someone else here can recommend
         | a front end.
        
         | rifur13 wrote:
         | Wow perfect timing. I'm working on a sub-realtime TTS (only on
         | Apple M-series silicon). Quality should be on-par or better
         | than XTTS2. Definitely shoot me a message if you're interested.
        
         | jeroenhd wrote:
         | RVC does live voice changing with a little latency:
         | https://github.com/RVC-Project/Retrieval-based-Voice-Convers...
         | 
         | The product isn't exactly spectacular, but most of the works
         | seems to have bene done. Just needs someone to go over the UI
         | and make it less unstable, really.
        
       | bsenftner wrote:
       | Is there a service here somewhere? The video mentions lower
       | expense, but I can't find any such service sign up... (ah, all
       | the usage info is all on github)
       | 
       | Has anyone tried self hosting this software?
        
       | smusamashah wrote:
       | The quality here is good (very good if I can actually run it
       | locally). As per github it looks like we can run it locally.
       | 
       | https://github.com/myshell-ai/OpenVoice/blob/main/docs/USAGE...
        
         | 486sx33 wrote:
         | Still a bit robotic but better highs and lows for sure. The
         | Catalog is huge! Thanks for posting
        
       | paraschopra wrote:
       | yay!
        
       | nonrandomstring wrote:
       | I also lost my voice in a bizarre fire breathing accident and
       | urgently need to log into my telephone banking account.
       | 
       | Can anyone here give me a short list of relatively common ethical
       | use cases for this technology. I'm trying to refute the
       | ridiculous accusation that only deceptive, criminal minded people
       | would have any use for this. Thanks.
       | 
       | ----
       | 
       | Edit: Thanks for the many good-faith replies. So far I see these
       | breaking down into;
       | 
       | Actual disability mitigation (of course I was joking about the
       | circus accident). These are rare but valid. Who wouldn't want
       | their _own_ restored.
       | 
       | Entertainment and games
       | 
       | Education and translation
       | 
       | No further comment on the ethics here, but FWIW I'm nervously
       | looking at this having just written a study module on social
       | engineering attacks, stalking, harassment and confidence tricks.
       | :/
       | 
       | And yes, as bare tech, it _is_ very cool!
        
         | BriggyDwiggs42 wrote:
         | Idk but its kinda cool
        
         | RobotToaster wrote:
         | Could be combined with translation to automatically create dubs
         | for videos/tv/etc.
        
         | Larrikin wrote:
         | Home Assistant is making huge progress in creating an open
         | source version of Alexa, Siri, etc. You can train it to use
         | your voice, but the obvious use is celebrity voices for your
         | home. Alexa had them, then took them away, and refused to
         | refund people.
        
           | diggan wrote:
           | > but the obvious use is celebrity voices for your home
           | 
           | Beside the fact that it seems more like a "entertainment" use
           | case rather than "functional", is it really ethical to use
           | someone's voice without asking/having rights to use it?
           | 
           | Small concern, granted, but parent seems to have specifically
           | asked for ethical use cases.
        
             | ChrisMarshallNY wrote:
             | I believe that a number of celebrities (I think Tom Hanks
             | is one) have already sued companies for using deepfakes of
             | their voices. Of course, the next year (in the US) is gonna
             | see a _lot_ of stuff generated by AI.
        
         | corobo wrote:
         | I imagine Stephen Hawking would have found a use for this had
         | it been available before everyone got used to his computer
         | speaking voice. Anything that may cause someone to lose their
         | ability to speak along the lines of your example.
         | 
         | Another might be for placeholdering - you could use an array of
         | (licensed and used appropriately) voices to pitch a TV show,
         | film, radio show, podcast, etc to give people a decent idea of
         | how it would sound to get financing and hire actual people to
         | make the real version. Ofc you'll need an answer to "why don't
         | we just use these AI voices in the actual production?" from
         | people trying to save a few quid.
         | 
         | Simple one- for fun. I'm considering AI cloning my voice and
         | tinkering around until I find something useful to do with it.
         | Maybe in my will I'll open source my vocal likeness as long as
         | it's only to be used commercially as the voice of a spaceship's
         | main computer or something. I'll be a sentence or two factoid
         | in some Wikipedia article 300 years from now, haha.
         | 
         | Universal translator - if an AI can replicate my voice it could
         | have me speak all sorts of languages in real-time.. sucks to be
         | a human translator admittedly in this use case. Once the tech
         | is fully ironed out and reliable we could potentially even get
         | rid of "official" languages (eg you have to speak fluent
         | English to be an airline pilot - heck of a learning curve on
         | top of learning to be a pilot if you're from a country that
         | doesn't teach English by default!)
         | 
         | I dunno if it'd be a weird uncanny valley thing, I wonder how
         | an audiobook would sound reading a book in my own voice -
         | unless I'm fully immersed in fiction that's generally how I
         | take in a book, subvocalising with my voice in my head - maybe
         | it'd help things bed in a bit better if it's my own voice
         | reading it to me? If so I wonder how fast I could have AI-me
         | read and still be able to take in the content with decent
         | recall.. Might have to test this one!
         | 
         | Splintering off the audiobook idea - I wonder if you could help
         | people untrain issues with their speaking in this manner? Like
         | would hearing a non-stuttering version of their voice help
         | someone with a stutter? I am purely in the land of hypothesis
         | at this stage, but might be worth trying! Even if it doesn't
         | help in that way, the person with a stutter would at least have
         | a fallback voice if they're having a bad day of it :)
         | 
         | E: ooh, having an AI voice and pitch shifting it may help in
         | training your voice to sound different, as you'd have something
         | to aim for - "I knew I could do it because I heard it being
         | done" sort of theory. The first example that popped into my
         | head was someone transitioning between genders and wanting to
         | adjust their voice to match the change.
         | 
         | I imagine there's other fields where this may be useful too -
         | like if you wanted a BBC news job and need to soften out your
         | accent (if they still require Received Pronunciation, idk)
         | 
         | Admittedly I could probably come up with more abuse cases than
         | use cases if I put my mind to it, but wanted to stick to the
         | assignment :)
        
           | mywacaday wrote:
           | Charlie Bird, a very well know Irish journalist and
           | broadcaster who recently passed away from motor neuron
           | disease went through the process of getting a digitized
           | version of his voice done as part of a TV program as he was
           | losing his voice rapidly at the time. The result was very
           | good as they had a large body of his news reports to train
           | the model on. Most Irish people would be very familiar with
           | his voice and the digitized version was very convincing. I
           | would imagine something like this would be great for people
           | who wouldn't have a huge volume of recordings to work off. A
           | short video by the company that provided the tablet with his
           | voice is here https://www.youtube.com/watch?v=UGjJHVUyi0M
        
         | CapsAdmin wrote:
         | The practical main use case I can think of is entertainment.
         | Games could use it, either dynamically or prerecorded. Amateur
         | videos could also use it for fun.
         | 
         | Outside of that, more versatile text to speech is generally
         | useful for blind people.
         | 
         | More emotional and non-robotic narration of text can also be
         | useful non-blind people on the go.
        
           | andrewmcwatters wrote:
           | It would be neat to have your game client locally store a
           | reference sentence on your system and generate voice chat for
           | you at times when you couldn't speak and could only type.
        
         | pmontra wrote:
         | I want to use (almost) my own voice in an English video without
         | my country's accent?
        
         | raudette wrote:
         | For creating games/entertainment/radio drama, allows 1 person
         | to voice act multiple roles
        
         | serbrech wrote:
         | On the fly speech translation but in the voice of the speaker
        
           | 7373737373 wrote:
           | Or a different voice if the voice of the speaker or the way
           | they talk is annoying
        
         | idle_zealot wrote:
         | It's mostly interesting to me for artistic applications, like
         | voicing NPC or video dialog, or maybe as a digital assistant
         | voice. Being able to clone existing voices would be useful for
         | parody or fanworks, but I suspect that it is also possible to
         | mix aspects of multiple voices to synthesize new ones to taste.
        
         | napkin wrote:
         | I'm currently using xtts2 to make language learning more
         | exciting, by training models on speakers I wish to emulate. I'm
         | really into voices, and this has helped tremendously for
         | motivation when learning German.
        
         | laurentlb wrote:
         | I think there are lots of applications for good Text-To-Speech.
         | 
         | Cloning a voice is a way to get lots of new voices to use as
         | TTS.
         | 
         | I'm personally building a website with stories designed for
         | language learners. I'd like to have a variety of realistic
         | voices in many languages.
        
         | freedomben wrote:
         | The reason I am looking for something, is because a friend of
         | mine died of cancer, but left some voice samples, and I want to
         | narrate an audiobook for his kids in his voice.
         | 
         | In general, though, I agree, the legitimate use cases for
         | something like this seem relatively minor compared to the
         | illegitimate use cases. However, the technology is here, and
         | simply depriving every one of it isn't going to stop the
         | scammers, as has already been evidenced. In my opinion, the
         | best thing for us to do is to rapidly get to a place where
         | everybody knows that you cannot trust the voice on the other
         | end anymore, as it could be cloned. Fortunately, the best way
         | to accomplish that is also the same way that we allow average
         | people to benefit from the technology: make it widely available
        
           | nonrandomstring wrote:
           | > In my opinion, the best thing for us to do is to rapidly
           | get to a place where everybody knows that you cannot trust
           | the voice on the other end anymore,
           | 
           | Strongly agree with this. Sadly I don't think that transition
           | to default distrust of voice will be rapid. We are wired at
           | quite a low level to respond to voice emotionally. which
           | bypasses our rational scepticism and vigilance. That's why
           | this is a rather big win for the tricksters.
        
         | _agt wrote:
         | At my university, we're using this tech to insert minor
         | corrections into lecture recordings (with instructor's consent
         | of course). Far more efficient than bringing them into a studio
         | for a handful of words, also less disruptive to content than
         | overlaid text.
        
         | RyanCavanaugh wrote:
         | I'd really like to make some video content (on-screen graphics
         | + voice), but the thought of doing dozens of voice takes and
         | learning to use editing software is really putting me off from
         | it. I'd really rather just write a transcript, polish it until
         | I'm satisfied with it, and then have the computer make the
         | audio for me.
         | 
         | I'll probably end up just using OpenAI TTS since it's good
         | enough, but if it could be my actual voice, I'd prefer that.
        
       | dougmwne wrote:
       | In related news, Voicecraft published their model weights today.
       | 
       | https://github.com/jasonppy/VoiceCraft
        
       | jasonjmcghee wrote:
       | The quality of the output is really fantastic compared with other
       | open source (next best XTTSv2).
       | 
       | The voice cloning doesn't seem as high quality as other products
       | I've used / seen demos for. Most of the examples match pitch
       | well, but lose the "recognizable" aspect. The Elon one just
       | doesn't sound like Elon, for example- interestingly the
       | Australian accent sounds more like him.
        
       | duggan wrote:
       | With a bit of coaxing I managed to get this running on my M2 mac
       | with Python 3.11.
       | 
       | Updated setup.py (mostly just bumping versions), and demo output:
       | https://gist.github.com/duggan/63b7de9b5f6e8e74fe4b05af64dbe...
        
       | smashah wrote:
       | Terrifying.
        
         | riskable wrote:
         | I know, right!? Soon everything is going to be AI-enabled and
         | our toothbrushes will be singing us Happy Birthday!
        
       | randkyp wrote:
       | This is HN, so I'm surprised that no one in the comments section
       | has run this locally. :)
       | 
       | Following the instructions in their repo (and moving the
       | checkpoints/ and resources/ folder into the "nested" openvoice
       | subfolder), I managed to get the Gradio demo running. Simple
       | enough.
       | 
       | It appears to be quicker than XTTS2 on my machine (RTX 3090), and
       | utilizes approximately 1.5GB of VRAM. The Gradio demo is limited
       | to 200 characters, perhaps for resource usage concerns, but it
       | seems to run at around 8x realtime (8 seconds of speech for about
       | 1 second of processing time.)
       | 
       | EDIT: patched the Gradio demo for longer text; it's way faster
       | than that. One minute of speech only took ~4 seconds to render.
       | Default voice sample, reading this very comment:
       | https://voca.ro/18JIHDs4vI1v I had to write out acronyms -- XTTS2
       | to "ex tee tee ess two", for example.
       | 
       | The voice clarity is better than XTTS2, too, but the speech can
       | sound a bit stilted and, well, robotic/TTS-esque compared to it.
       | The cloning consistency is definitely a step above XTTS2 in my
       | experience -- XTTS2 would sometimes have random pitch shifts or
       | plosives/babble in the middle of speech.
        
         | bambax wrote:
         | I am trying to run it locally but it doesn't quite work for me.
         | 
         | I was able to run the demos allright, but when trying to use
         | another reference speaker (in demo_part1), the result doesn't
         | sound at all like the source (it's just a random male voice).
         | 
         | I'm also trying to produce French output, using a reference
         | audio file in French for the base speaker, and a text in
         | French. This triggers an error in api.py line 75 that the
         | source language is not accepted.
         | 
         | Indeed, in api.py line 45 the only two source languages allowed
         | are English and Chineese; simply adding French to
         | language_marks in api.py line 43 avoids errors but produces a
         | weird/unintelligible result with a super heavy English accent
         | and pronunciation.
         | 
         | I guess one would need to generate source_se again, and
         | probably mess with config.json and checkpoint.pth as well, but
         | I could not find instructions on how to do this...?
         | 
         | Edit -- tried again on https://app.myshell.ai/ The result
         | sounds French alright, but still nothing like the original
         | reference. It would be absolutely impossible to confuse one
         | with the other, even for someone who didn't know the person
         | very well.
        
           | randkyp wrote:
           | I played with it some more and I have to agree. For actual
           | voice _cloning_, XTTS2 sounds much, much closer to the
           | original speaker. But the resulting output is also much more
           | unpredictable and sometimes downright glitchy compared to
           | OpenVoice. XTTS2 also tries to "act out" the implied
           | emotion/tone/pitch/cadence in the input text, for better or
           | worse.
           | 
           | But my use case is just to have a nice-sounding local TTS
           | engine, and current text-to-phoneme conversion quirks aside,
           | OpenVoice seems promising. It's fast, too.
        
             | echelon wrote:
             | And StyleTTS2 generalizes out of domain even better than
             | that.
        
         | epiccoleman wrote:
         | I have got to build or buy a new computer capable of playing
         | with all this cool shit. I built my last "gaming" PC in 2016,
         | so its hardware isn't really ideal for AI shenanigans, and my
         | Macbook for work is an increasingly crusty 2019 model, so
         | that's out too.
         | 
         | Yeah, I could rent time on a server, but that's not as cool as
         | just having a box in my house that I could use to play with
         | local models. Feels like I'm missing a wave of fun stuff to
         | experiment with, but hardware is expensive!
        
           | beardedwizard wrote:
           | I would love a recommendation for an off the shelf "gpu
           | server" good for most of this that I can run at home.
        
             | lakomen wrote:
             | I'm clueless about AI, but here's a benchmark list
             | https://www.videocardbenchmark.net/high_end_gpus.html
             | 
             | Imo the 4070 super is the best value and consumes the least
             | amount of Watts, 220 in all the top 10.
             | 
             | So anything with one and some ECC RAM aka AMD should be
             | fine. Intel non-xeons need the expensive w680 boards and
             | very specific RAM per board.
             | 
             | ECC because you wrote server. We're professionals here
             | after all, right?
        
               | antonvs wrote:
               | What if I enjoy gambling with cosmic ray bitflips?
        
               | GTP wrote:
               | Maybe they would make your AI model evolve into an AGI
               | over time :D
        
             | lardo wrote:
             | CivitAI has one https://civitai.com/builds
        
             | macrolime wrote:
             | Mac Studio or macbook pro if you want to run the larger
             | models. Otherwise just a gaming pc with an rtx 4090 or a
             | used rtx 3090 if you want something cheaper. A used dual
             | 3090 can also be a good deal, but that is more in the build
             | it yourself category than off the shelf.
        
               | 101008 wrote:
               | Sorry if this is a silly question - I was never a Mac
               | user, but I quick googled Mac Studio and it seems it's
               | just the computer. Can I plug it to any monitor / use any
               | keyboard and mouse, or do I need to use everything from
               | Apple with it?
        
               | timschmidt wrote:
               | Any monitor and keyboard will work, however Apple
               | keyboards have a couple extra keys not present on Windows
               | keyboards so require some key remapping to allow access
               | to all typical shortcut key combinations.
        
               | spectre3d wrote:
               | Mainly to swap the Windows and Alt keys, which you can do
               | in System Settings without any additional software.
               | 
               | If you use a mouse with more than right-click and scroll
               | wheel, with side buttons for example, then you'll need
               | extra software.
        
               | macrolime wrote:
               | You can, but with some caveats. Not all screen
               | resolutions work well with MacOS, though using
               | BetterDisplay it will still usually work. If you want
               | touch id, it's better to get the Magic Keyboard with
               | touch id.
        
               | pksebben wrote:
               | I went the 4090 route myself recently, and I feel like
               | all should be warned - memory is a major bottleneck. For
               | a lot of tasks, folks may get more mileage out of
               | multiple 3090s if they can get them set up to run
               | parallel.
               | 
               | Still waiting on being able to afford the next 4090 +
               | egpu case et al. There are a lot of things this rig
               | struggles with running OOM, even on inference with some
               | of the more recent SD models.
        
               | ckl1810 wrote:
               | Depending on what models you want to run, RTX 4090 or RTX
               | 3090 may not be enough.
               | 
               | Grok-1 was running on a M2 Ultra with 196GB of ram.
               | 
               | https://twitter.com/ibab_ml/status/1771340692364943750
        
           | holtkam2 wrote:
           | I'm in exactly the same boat. Yeah ofc you can run LMs on
           | cloud servers but my dream project would be to construct a
           | new gaming PC (mine is too old) and serve a LM on it, then
           | serve an AI agent app which I can talk to from anywhere.
           | 
           | Has anyone had luck buying used GPUs, or is that something I
           | should avoid?
        
           | sangnoir wrote:
           | > its hardware isn't really ideal for AI shenanigans
           | 
           | FWIW, I was in the same boat as you and decided to start
           | cheap, old game machines can handle AI shenanigans just fine
           | wirh the right GPU. I use a 2017 workstation (Zen1) and an
           | Nvidia P40 from around the same time, which can be had for
           | <$200 on ebay/Amazon. The P40 has 24GB VRAM, which is more
           | than enough for a good chunk of quantized LLMs or diffusion
           | models, and is in the same perf ballpark as the free Colab
           | tensor hardware.
           | 
           | If you're just dipping your toes without committing, I'd
           | recommend that route. The P40 is a data center card and
           | expects higher airflow than desktop GPUs, so you probably
           | have to buy a "blow kit" or 3D-print a fan shroud and ensure
           | they fit inside your case. This will be another $30-$50. The
           | bigger the fan, the quieter it can run. If you already have a
           | high-end gamer PC/workstation from 2016, you can dive into
           | local AI for $250 all-in.
           | 
           | Edit: didn't realize how cheap P40s now are! I bought mine a
           | while back.
        
         | zoklet-enjoyer wrote:
         | I forgot all about Vocaroo!
        
         | causi wrote:
         | We're so close to me being able to open a program, feed in an
         | epub, and get a near-human level audiobook out of it. I'm so
         | excited.
        
           | aedocw wrote:
           | Give https://github.com/aedocw/epub2tts a look, the latest
           | update enables use of MS Edge cloud-based TTS so you don't
           | need a local GPU and the quality is excellent.
        
           | jurimasa wrote:
           | I think this is creepy and dangerous as fuck. Not worth the
           | trouble it will be.
        
             | CamperBob2 wrote:
             | Other sites beckon.
        
             | _zoltan_ wrote:
             | you're gonna be REALLY surprised out there in the real
             | world.
        
         | aftbit wrote:
         | I want to try chaining XTTS2 with something like RVCProject.
         | The idea is to generate the speech in one step, then clone a
         | voice in the audio domain in a second step.
        
       | joshspankit wrote:
       | Does anyone know which local models are doing the "opposite":
       | Identify a voice well enough to do speaker diarization across
       | multiple recordings?
        
         | Drakim wrote:
         | On my wishlist would be a local model that can generate new
         | voices based on descriptions such as "rough detective-like hard
         | boiled man" or "old fatherly grampa"
        
           | mattferderer wrote:
           | You might be interested in this cool app that Microsoft made
           | that I don't think I've seen anyone talk about anywhere
           | called Speech Studio. https://speech.microsoft.com/
           | 
           | I don't recall their voices being the most descriptive but
           | they had a lot. They also let layout a bunch of text & have
           | different voices speak each line just like a movie script.
        
         | satvikpendem wrote:
         | Whisper can do diarization but not sure it will "remember" the
         | voices well enough. You might simply have to stitch all the
         | recordings together, run it through Whisper to get the diarized
         | transcript, then process that how you want.
        
           | beardedwizard wrote:
           | Whisper does not support diarization. There are a number of
           | projects that try to add it.
        
         | Teleoflexuous wrote:
         | Whisper doesn't, but WhisperX
         | <https://github.com/m-bain/whisperX/> does. I am using it right
         | now and it's perfectly serviceable.
         | 
         | For reference, I'm transcribing research-related podcasts,
         | meaning speech doesn't overlap a lot, which would be a problem
         | for WhisperX from what I understand. There's also a lot of
         | accents, which are straining on Whisper (though it's also doing
         | well), but surely help WhisperX. It did have issues with
         | figuring out the number of speakers on it's own, but that
         | wasn't a problem for my use case.
        
           | joshspankit wrote:
           | WhisperX does diarization, but I don't see any mention of it
           | fulfilling my ask which makes me think I didn't communicate
           | it well.
           | 
           | Here's an example for clarity:
           | 
           | 1. AI is trained on the voice of a podcast host. As a side
           | effect it now (presumably) has all the information it needs
           | to replicate the voice
           | 
           | 2. All the past podcasts can be processed with the AI
           | comparing the detected voice against the known voice which
           | leads to highly-accurate labelling of that person
           | 
           | 3. Probably a nice side bonus: if two people with different
           | registers are speaking over each other the AI could separate
           | them out. "That's clearly person A and the other one is
           | clearly person C"
        
             | c0brac0bra wrote:
             | You can check out PicoVoice Eagle (paid product):
             | https://picovoice.ai/docs/eagle/
             | 
             | You pass N number of PCM frames through their trainer and
             | once you reach a certain percentage you can extract an
             | embedding you can save.
             | 
             | Then you can identify audio against the set of identified
             | speakers and it will return percentage matches for each.
        
         | c0brac0bra wrote:
         | Picovoice says they do this but it's a paid product. It
         | supposedly runs on the device but you still need a key and have
         | to pay per minute.
        
       | lordofgibbons wrote:
       | I've noticed that all TTS systems have a "metalic" sound to them.
       | Can this be fixed automatically using some kine of post-
       | processing?
        
         | huytersd wrote:
         | Try cutting out some of the highs?
        
       | muglug wrote:
       | It's funny how a bunch of models use Musk's voice as a proof of
       | their quality, given how disjointed and staccato he sounds in
       | real life. Surely there are better voices to imitate.
        
         | iinnPP wrote:
         | Proving the handling of uncommon speech is definitely a great
         | example to use alongside the other common and uncommon speech
         | examples on the page.
        
         | ianschmitz wrote:
         | Especially with all of the crypto scams using Elon's voice
        
       | tonnydourado wrote:
       | I might be missing something, but what are the non-questionable,
       | or at least non-evil, uses of this technology? Because every
       | single application I can think of is fucked up: porn, identity
       | theft, impersonation, replacing voice actors, stealing the
       | likeness of voice actors, replacing customer support without
       | letting the customers know you're using bots.
       | 
       | I guess you could give realistic voices to people that lost their
       | voices by using old recordings, but there's no way that this is a
       | market that justify the investment.
        
         | swores wrote:
         | What about for remembering lost loved ones? There are dead
         | people I would love to hear talk again, even if I know it's not
         | their personality talking just their voice (and who knows,
         | maybe with LLM training on a single person it could even be
         | roughly their personality, too).
         | 
         | I can imagine a fairly big market of both people setting it up
         | before they die, with maybe a whole load of written content and
         | a schedule of when to have it read in future, and people who've
         | just lost someone, and want to recreate their voice to help
         | remember it.
        
           | tonnydourado wrote:
           | > I can imagine a fairly big market (...)
           | 
           | I can't, and if I could, I think this would be fairly
           | dystopian. Didn't black mirror have an episode about
           | something similar? I vaguely remember an Asimov/Arthur C.
           | Clark short story about the implications of time travel (ish)
           | tech in a similar context. Sounds like a case of "we've build
           | the torment nexus from classic sci-fi novel 'do not build the
           | torment nexus'"
        
           | dotancohen wrote:
           | Jack Crusher did something similar for Wesley.
        
           | grugagag wrote:
           | We already have ways to preserve the voices of people past
           | their lives. Cloning their voices and writing things in their
           | names is not only wrong but deceptive.
        
         | wdb wrote:
         | You can use it to easily fix voice overs on you videos without
         | needing to re-record etc.
        
           | tonnydourado wrote:
           | Reasonable, but I'm skeptical of the market
        
         | CuriouslyC wrote:
         | Text to speech is very close to being able to replace voice
         | actors for a lot of lower budget content. Voice cloning will
         | let directors and creators get just the sound they want for
         | their characters, imagine being able to say "I want something
         | that sounds like Harrison Ford with a French accent." Of
         | course, there are going to be debates about how closely you can
         | clone someone's voice/diction/etc, both extremes are wrong -
         | perfect cloning will hurt artists without bringing extra value
         | to directors/creators, but if we outlaw things that sound
         | similar the technology will be neutered to uselessness.
        
           | tonnydourado wrote:
           | That's basically replacing voice actors and stealing their
           | likeness: both are arguably evil, and mentioned. So, I
           | haven't missed them.
           | 
           | P.S.: "but what about small, indie creators" that's not who's
           | gonna embrace this the most, it's big studios, and they will
           | do it to fuck over workers.
        
             | CuriouslyC wrote:
             | As someone involved in the AI creator sphere, that's a very
             | cold take. Big studios pay top shelf voice talent to create
             | the best possible experience because they can afford it. Do
             | you think Blizzard is using AI to voice
             | Diablo/Overwatch/Warcraft? Of course not. On the other
             | hand, there are lots of small indie games being made now
             | that utilize TTS, because the alternative is no voice, the
             | voice of a friend or a very low quality voice actor.
             | 
             | Do I want to have people making exact clones of voice
             | actors? No. The problem is that if you say "You can't get
             | 90% close to an existing voice actor" then the technology
             | will be able to create almost no human voices, it'll
             | constantly refuse like gemini, even when the request is
             | reasonable. This technology is incredibly powerful and
             | useful, and we shouldn't avoid using it because it'll force
             | a few people to change careers.
        
               | tonnydourado wrote:
               | Have you seen how big studios treat vfx artists? They
               | absolutely will replace voice actors with AI.
               | 
               | Also:
               | 
               | > This technology is incredibly powerful and useful
               | 
               | At what, exactly? The only "useful" case you presented is
               | "actually, replacing voice actors with AI isn't so bad".
        
               | CuriouslyC wrote:
               | You want a world where only the rich can create beautiful
               | experiences. You're either rich or short sighted.
               | 
               | Edit: If you've got a cadre of volunteer voice actors
               | that don't suck hidden somewhere, you need to share
               | buddy. That's the only way your comments make sense.
        
               | tonnydourado wrote:
               | I don't know what else to tell you, I just think people
               | deserve to be paid for the work they do.
               | 
               | Your vision of a world where anyone can create voice for
               | their projects for cheap CAN NOT exist without someone
               | getting exploited. Nor is it sustainable, really.
               | 
               | You said they this world would be worth some people
               | losing their careers, but what do we gain? More
               | games/audiobooks of questionable quality? Is this really
               | worth fucking a whole profession over?
        
               | CuriouslyC wrote:
               | We agree that people should be paid for the work that
               | they *DO*. Your view smacks of elitism, and voice actors
               | don't have any more right to be able to make decent money
               | peddling their voice than indie game devs have to peddle
               | games with synthetic voices.
        
               | tonnydourado wrote:
               | Your view smacks of contempt for workers, particularly in
               | the arts. Specially the emphasis on "do", as if voice
               | actors don't actually work, and just live of royalties or
               | something. The kind of worldview that the rich and the
               | delusioned working poor tend to share.
        
               | amarant wrote:
               | Professions disappear, it's a natural side effect of
               | progress. Stablehands aren't really that common anymore,
               | because most people drive cars instead of horses.
               | 
               | I really hope we can deprecate a whole bunch of
               | professions related to fossil fuels, including coal
               | miners and oil drillers etc.
               | 
               | I sympathise with the people working in those
               | professions, I do, but times change and professions come
               | and go, and I don't buy the argument that we should stop
               | inventing new stuff because it might outcompete people.
               | 
               | As for positive uses of this technology, it might be used
               | to immortalise a voice actor. For example Sir David
               | Attenborough probably won't be around forever, but thanks
               | to this technology, his iconic voice might be!
        
               | wsintra2022 wrote:
               | I made an e book of Carl Rogers narrated by David
               | Attenborough, turned out decent, I used coquai who sadly
               | have closed with all my API credits
        
               | Osmose wrote:
               | You have a narrow view of what a beautiful experience is.
               | It does not require professional-level voice acting.
               | 
               | It is not unfair that, in order to have voice acting, you
               | must have someone perform voice acting. You don't have
               | the natural right to professional-level voice acting for
               | free, nor do you need it to create beautiful things.
               | 
               | The tech is simply something that may be possible, and it
               | has tradeoffs, and claiming that it's an accessibility
               | problem does not grant you permission to ignore the
               | tradeoffs.
        
               | ben_w wrote:
               | > You don't have the natural right to professional-level
               | voice acting for free
               | 
               | I also don't have the natural right to work as a
               | professional-level voice actor.
               | 
               | "Natural rights" aren't really a thing, the phrase is a
               | thought-terminating cliche we use for the rhetorical
               | purpose of saying something is good or bad without having
               | to justify it further.
               | 
               | > The tech is simply something that may be possible, and
               | it has tradeoffs, and claiming that it's an accessibility
               | problem does not grant you permission to ignore the
               | tradeoffs.
               | 
               | A few times as a kid, I heard the meme that the American
               | constitution allows everything then tells you what's
               | banned, the French one bans everything then tells you
               | what's allowed, and the Soviet one tells you nothing and
               | arrests you anyway.
               | 
               | It's not a very accurate meme, but still, "permission" is
               | the wrong lens: it's allowed until it's illegal. You want
               | it to be illegal to replace voice actors with synthetic
               | voices, you need to campaign to make it so as this isn't
               | the default. (Unlike with using novel tech for novel
               | types of fraud, where fraud is already illegal and new
               | tech doesn't change that).
        
               | Riverheart wrote:
               | "You want a world where only the rich can create
               | beautiful experiences. You're either rich or short
               | sighted."
               | 
               | Being rich to create a beautiful experience is neither
               | required nor does it require a synthetic voice to
               | achieve.
               | 
               | It does require effort and being rich can reduce that
               | effort for sure.
        
               | ceejayoz wrote:
               | > Do you think Blizzard is using AI to voice
               | Diablo/Overwatch/Warcraft? Of course not.
               | 
               | Do you think Blizzard won't when the tech gets cheap and
               | good enough?
        
               | CuriouslyC wrote:
               | Probably not, because the voice actors are a community
               | draw. In fact, one of the top threads in the overwatch
               | subreddit right now is pictures of all the voice actors.
               | They go to cons and interact with fans and they don't
               | cost so much that losing that value to save a few bucks
               | is worth it.
        
               | Osmose wrote:
               | The lightness with which you treat forcing tens of
               | thousands of people to change their career is absurd.
               | Indie games are hardly suffering for a lack of voice
               | acting, even if you only look at it from a market
               | perspective and ignore that voice acting is a creative
               | interpretation and not simply reading the words the way
               | the director wants.
               | 
               | Yes, we should avoid using it because it will upend the
               | lives of a significant amount of artists for the primary
               | benefit of "some indie games will have more voice acting
               | and big game companies will be able to save money on
               | voice actors". That's not worth it, how could you think
               | it is?
        
               | ben_w wrote:
               | > The lightness with which you treat forcing tens of
               | thousands of people to change their career is absurd.
               | 
               |  _Only_ tens of thousands? Cute. For most of the 2010s, I
               | was expecting self-driving cars to imminently replace
               | truck drivers, which is a few millions in the US alone
               | and I think around 40-45 million worldwide. I still do
               | expect AI to replace humans for driving, I just don 't
               | know how long it will take. (I definitely wasn't
               | expecting "creative artistry" to be an easier problem
               | than "don't crash a car", I didn't appreciate that nobody
               | minds if even 90% of the hands have 6 fingers while
               | everyone minds if a car merely equals humans by failing
               | to stop in 1 of every (3.154e7 seconds per year * 1.4e9
               | vehicles / 30000 human driving fatalities per year ~=
               | 1.47e+12) seconds of existence).
               | 
               | Almost every nation used to be around 90% farm workers,
               | now it's like 1-5% (similar numbers to truckers) and even
               | those are scared of automation; the immediate change was
               | to factory jobs, but those too have shifted into service
               | roles because of automation of the former, and the rest
               | are scared of automation (and outsourcing).
               | 
               | Those service-sector roles? "Computer" used to be a job;
               | Graphical artists are upset about Stable Diffusion;
               | Anyone working with text, from Hollywood script writers
               | to programmers to lawyers, is having to justify their own
               | wages vs. an LLM (for now, most of us are winning this
               | argument; but for how long?)
               | 
               | We get this wrong, it's going to be a disaster; we get it
               | right, we're all living better the 0.1%.
               | 
               | > Indie games are hardly suffering for a lack of voice
               | acting, even if you only look at it from a market
               | perspective and ignore that voice acting is a creative
               | interpretation and not simply reading the words the way
               | the director wants.
               | 
               | I tried indie game development for a bit. I gave up with
               | something like PS1,000 in my best year. (You can probably
               | double that to account for inflation since then).
               | 
               | This is because the indie game sector is also not
               | suffering from a lack of developer talent, meaning
               | there's a lot of competition that drives prices below the
               | cost of living. Result? Hackathons where people compete
               | for the fun of it, not for the end product. Those
               | hackathons are free to say if they do or don't come with
               | rules about GenAI; but in any case, they definitely come
               | with no budget.
               | 
               | > Yes, we should avoid using it because it will upend the
               | lives of a significant amount of artists for the primary
               | benefit of "some indie games will have more voice acting
               | and big game companies will be able to save money on
               | voice actors". That's not worth it, how could you think
               | it is?
               | 
               | A few hours ago I was in the Deutsches Technikmuseum;
               | there's a Jacquard Loom by the cafe: https://technikmuseu
               | m.berlin/ausstellungen/dauerausstellunge...
               | 
               | The argument you give here is much the same argument used
               | against that machine, back in the day:
               | https://spectrum.ieee.org/the-jacquard-loom-a-driver-of-
               | the-...
               | 
               | Why do you think those textile workers lost the argument?
               | 
               | And to pre-empt what I think is a really obvious counter,
               | I would also add that the transition we face must be
               | handled with care and courtesy to the economic fears --
               | to all those who read my comment and think "and therefore
               | this will be easy and we should embrace it, just dismiss
               | the nay-sayers as the Luddites they are": why do you
               | think Karl Marx wrote the Communist Manifesto?
        
               | waterhouse wrote:
               | Suppose all existing voice actors, and, to be maximally
               | generous, everyone who had spent >1 year training to be a
               | voice actor, was given a pension for some years, paying
               | them the greater of their current income or some average
               | voice actor income. And then there would be no limits on
               | using AI voices to substitute for voice actors.
               | 
               | Would you be happy with that outcome, or do you have
               | another objection?
        
             | allannienhuis wrote:
             | I don't disagree with the thought that large companies are
             | going to try to use these technologies too, with typical
             | lack of ethics in many cases.
             | 
             | But some of this thinking is a bit like protesting the use
             | of heavy machinery in roadbuilding/construction, because it
             | displaces thousands of people with shovels. One difference
             | with this type of technology is that the means to use it
             | doesn't require massive amounts of capital like the heavy
             | machinery example, so more of those shovel-weilders will be
             | able to compete with those that are only bringing captial
             | to the table.
        
               | tonnydourado wrote:
               | I'm not saying that this should be forbidden or
               | something. I just wonder what is the motivation for the
               | people pitching and actually developing this. I'm all for
               | basic, non-profit-driven, research, but at some point you
               | gotta ask yourself "what am I helping create here?"
        
               | CrazyStat wrote:
               | Saying something is evil would seem to suggest that you
               | think it should be forbidden. Maybe you should choose a
               | different word if that's not your intention.
        
             | ben_w wrote:
             | I disagree on three of your points.
             | 
             | It is creating a new and fully customisable voice actor
             | that perfectly matches a creative vision.
             | 
             | To the extent that a skilled voice actor can already blend
             | existing voices together to get, say, French Harrison Ford,
             | for it to be evil for a machine to do it would require it
             | to be evil for a human to do it.
             | 
             | Small indie creators have a budget of approximately
             | nothing, this kind of thing would allow them to voice _all_
             | NPCs in some game rather than just the main quest NPCs.
             | (And that 's true even in the absence of LLMs to generate
             | the flavour text for the NPCs so they're not just repeating
             | "...but then I took an arrow to the knee" as generic
             | greeting #7 like AAA games from 2011).
             | 
             | Big studios _may also_ use this for NPCs to the economic
             | detriment of current voice actors, but I suspect this will
             | be a tech which leads to  "induced demand"[0] -- though
             | note that this can also turn out _very badly_ and isn 't
             | always a good thing either:
             | https://en.wikipedia.org/wiki/Cotton_gin
             | 
             | [0] https://en.wikipedia.org/wiki/Induced_demand
        
         | allannienhuis wrote:
         | I can think that better quality audio content generated from
         | text would be a killer application. As someone else mentioned,
         | pipe in an epub, output an audiobook or video game content.
         | With additional tooling (likely via ai/llm analysis), this
         | could enable things like dramatic storytelling with specific
         | character voices and dynamics interpreted from the content of
         | the text.
         | 
         | I can see it empowering solo creators in similar ways that
         | modern music tools enable solo or small-budget musicians today.
        
           | latexr wrote:
           | > pipe in an epub, output an audiobook or video game content.
           | 
           | That falls into "replacing voice actors", mentioned by the
           | OP.
        
             | blackqueeriroh wrote:
             | No, it really doesn't. There are thousands of very smart
             | and talented creators without the budget to hire voice
             | actors. This lets them get a start. AI voices let you lower
             | the barrier to entry, but they won't replace most voice
             | actors because the higher you go up the stack, the more the
             | demand for real actors will also go up because AI voices
             | aren't anywhere near being able to replace real voice
             | actors.
        
               | tonnydourado wrote:
               | As another reply put, I'm very skeptical that the
               | benefits for small content creators will offset the
               | damaged to society as a whole, from increased fraud and
               | harassment.
        
               | latexr wrote:
               | > AI voices let you lower the barrier to entry, but they
               | won't replace most voice actors because the higher you go
               | up the stack, the more the demand for real actors will
               | also go up
               | 
               | That is as absurd as saying LLMs are increasing the
               | demand for writers.
               | 
               | > because AI voices aren't anywhere near being able to
               | replace real voice actors.
               | 
               | Even if that were true--which it is not; the current crop
               | is more than adequate to read long texts--it assumes the
               | technology has reached its limit, which is equally
               | absurd.
        
             | albert_e wrote:
             | What if I want to listen to my notes in my own voice
             | 
             | Or my favorite books in my own voice.
             | 
             | Or my lecture notes in my professor's voice.
        
           | devinprater wrote:
           | Or, when it gets fast enough, someone could have their own
           | personal dub of video games (BlazBlue Central Fiction) or TV
           | shows and such.
        
         | mostrepublican wrote:
         | I used it to translate a short set of tv shows that were only
         | available in Danish with no subtitles in any other language and
         | made them into English for my personal watching library.
         | 
         | The episodes are about 95% just a narrator with some background
         | noises.
         | 
         | Elevenlabs did a great job with it and I cranked through the 32
         | episodes (about 4 mins each) relatively easily.
         | 
         | There is a longer series (about 60 hours) only in Japanese that
         | I want to do the same thing for. But don't want to spend
         | Elevenlabs prices to do.
        
           | ukuina wrote:
           | OpenAI TTS is very competitively priced: $15/1M chars.
        
         | kajecounterhack wrote:
         | I like the idea of cloning my own voice and having it speak in
         | a foreign language
        
         | SunlitCat wrote:
         | Maybe having better real time conversations in computer games.
         | Like game characters saying your name in voiceovers.
        
         | AnonC wrote:
         | > what are the non-questionable, or at least non-evil, uses of
         | this technology?
         | 
         | iPhone Personal Voice [1] is one. It helps people who are
         | physically losing their voice and the ones around them to still
         | have their voice in a different way. Apple takes long voice
         | samples of various texts for this though.
         | 
         | [1]: https://www.youtube.com/watch?v=ra9I0HScTDw
        
           | tonnydourado wrote:
           | That's kinda what I was thinking on the second paragraph.
           | Still, gotta be a small market.
        
         | IMTDb wrote:
         | Non robotic screen readers for blind people
        
           | tonnydourado wrote:
           | That would be non-evil, sure. But I wonder if blind people
           | even want it? They're already listening to screen readers at
           | insane speeds, up to 6-8x, I think. Do they even care that it
           | doesn't sound "realistic"?
        
             | blackqueeriroh wrote:
             | Well, I'm sure the blind readers of HN (which I am certain
             | exist) can answer this question, and you, a sighted person,
             | don't need to even wonder from your position of unknowing.
        
               | tonnydourado wrote:
               | I mean, I explicitly used "wonder" because I don't wanna
               | assume about blind people's experiences and needs. What
               | else should I have done so you wouldn't come in kicking
               | me in the nuts?
        
               | SamPatt wrote:
               | In this thread there's a bunch of "non-evil" responses,
               | and your replies are all "I'm skeptical" or just
               | dismissing them outright.
               | 
               | It appears from the outside that you've decided this is
               | Officially Bad technology and aren't genuinely seeking
               | evidence otherwise.
        
               | tonnydourado wrote:
               | You're assuming worse of me than I'm assuming of the
               | technology.
               | 
               | There's almost no reply here with a use that is a) not
               | somewhat bad and b) has enough of an upside to compensate
               | the downsides.
               | 
               | Except maybe this one, but I do know enough about
               | accessibility to know how blind people generally use
               | computers, which is why I asked the question.
        
         | Mkengine wrote:
         | I don't know how stressful my life will be then, but I thought
         | about reading to my kids later and creating audiobooks with my
         | voice for them, for when I am traveling for work, so they can
         | still listen to me "reading" to them.
        
         | bigcoke wrote:
         | AI girlfriend... ok I'm done.
        
           | lenerdenator wrote:
           | It's 2024. Are nerds still trying to turn any technology of
           | sufficient ability into Kelly LeBrock?
        
             | bigcoke wrote:
             | this is going to be a real thing for gen z, but replace
             | kelly with any girl from anime
        
               | lenerdenator wrote:
               | Jeeze, I can't imagine why women feel so alienated from
               | the tech industry.
               | 
               | It's almost as if any time some sort of way to make
               | computers more human-like emerges, the first thing a
               | subset of the men in the space do is think "How can I use
               | this to make a woman who has _absolutely_ no function
               | other than my emotional, practical, and physical
               | gratification? "
        
               | amenhotep wrote:
               | Humans in desiring deep emotional and sexual connections
               | with people of their desired gender and being driven to
               | weird behaviours when they can't achieve it in the way
               | you personally approve of shock
        
               | lenerdenator wrote:
               | Then work on it. Ask friends for feedback. Go to therapy.
               | Have some damned introspection instead of just reducing
               | 51% of the people on the planet to a bangmaid.
        
         | lenerdenator wrote:
         | > there's no way that this is a market that justify the
         | investment.
         | 
         | It's not just worth justifying investment. You can make just
         | about anything worth the investment as measured by a 90-day
         | window of fiscal reporting. H!tmen were a wildly profitable
         | venture for La Cosa Nostra.
         | 
         | It's about not justifying the societal _risk_.
        
         | YoshiRulz wrote:
         | It could be used by people who can write English fluently, but
         | are slow at speaking it, as a more personal form of text-to-
         | speech.
         | 
         | Personally, I'm eager to have more control over how my voice
         | assistant sounds.
        
           | Zambyte wrote:
           | Similarly, a real-time voice to voice translation system that
           | uses the speakers voice would be really cool.
        
         | layer8 wrote:
         | It enables to use your favorite audiobook reader's voice for
         | all your TTS needs. E.g. you can have HN comments read to you
         | by Patrick Steward, or by the Honest Trailers voice. Maybe you
         | find that questionable? ;)
        
           | zdragnar wrote:
           | So, replacing voice actors with unpaid clones of their
           | voices, effectively stealing their identity.
           | 
           | The range of use goes from totally harmless fun to downright
           | evil.
        
             | RyanCavanaugh wrote:
             | The existence of Photoshop doesn't mean that you can put
             | Kobe Bryant on a Wheaties box without paying him. There's
             | no reason that a voice talent's voice can't be subject to
             | the same infringement protections as a screen actor's or
             | athlete's likeness.
        
               | popalchemist wrote:
               | You absolutely can put Kobe on a Wheaties box without
               | problems legally, IF you do not sell it. That's "fair
               | use." It has not been tested in court yet, but precedent
               | seems to suggest that creating voice clones for private
               | use is also fair use, ESPECIALLY if that person is a
               | celebrity, because privacy rights are limited for
               | celebrities.
        
             | layer8 wrote:
             | If I take pictures of someone and hang my home with AI-
             | generated copies of those pictures, I'm not stealing their
             | identity.
        
           | johncalvinyoung wrote:
           | Utterly questionable.
        
         | wongarsu wrote:
         | Organized crime should be happy to invest in that. Especially
         | the "indian scam callcenter" type of crime.
        
         | tompetry wrote:
         | I have the same concerns generally. But one non-evil popped
         | into my head...
         | 
         | My dad passed away a few months ago. Going through his things,
         | I found all of his old papers and writings; they have great
         | meaning to me. It would be so cool to have them as audio files,
         | my dad as the narrator. And for shits, try it with a British
         | accent.
         | 
         | This may not abate the concerns, but I'm sure good things will
         | come too.
        
           | block_dagger wrote:
           | Serious question: is this a healthy way to treat ancestors?
           | In the future will we just keep grandma around as an AI
           | version of her middle aged self when she passes?
        
             | tompetry wrote:
             | Fair question. People have kept pictures, paintings, art,
             | belongings, etc of their family members for countless
             | generations. AI will surely be used to create new ways to
             | remember loved ones. I think that is a big difference than
             | "keeping around grandma as an AI version of herself", and
             | pretending they are still alive, which I agree feels
             | unhealthy.
        
             | Narishma wrote:
             | There's a Black Mirror episode about something like that,
             | though I don't remember the details.
        
               | oli-g wrote:
               | Yup, "Be Right Back", S2E1
               | 
               | And possibly another one, but that would be a spoiler
        
               | GTP wrote:
               | I remember a journalist actually doing it, but just the
               | AI part of course, not the robot.
        
             | gremlinsinc wrote:
             | it worked for super man, he seemed well adjusted after
             | talking to his dead parents.
        
             | mynameisash wrote:
             | I think everyone's entitled to their opinion here. As for
             | me, though: my brother died at 10 years old (back in the
             | 90s). While there are some home videos with him talking,
             | it's never for more than a few seconds at a time.
             | 
             | Maybe a decade ago, I came across a cassette tape that he
             | had used to record himself reading from a book for school -
             | several minutes in duration.
             | 
             | It was incredibly surprising to me how much he sounded like
             | my older brother. It was a very emotional experience, but
             | personally, I can't imagine using that recording to
             | bootstrap a model whereby I could produce more of his
             | "voice".
        
           | hypertexthero wrote:
           | Not sure if this is related to this tech, but I think it is
           | worthwhile: The Beatles - Now And Then - The Last Beatles
           | Song (Short Film)
           | 
           | https://www.youtube.com/watch?v=APJAQoSCwuA
        
         | bdcravens wrote:
         | The first couple I've come up with are training courses at
         | scale, or converting videos with accents you have a hard time
         | understanding to one you can (no one you'll understand better
         | than yourself)
        
         | accrual wrote:
         | A long term goal of mine is to have a local LLM trained on my
         | preferences and with a very long memory of past conversations
         | that I could chat with in real time using TTS. It would be
         | amazing to go on a walk with Airpods and chat with it, ask
         | questions, learn about topics, etc.
        
           | willsmith72 wrote:
           | I do that already with the chatgpt mobile app, but not with
           | my own voice.
           | 
           | I'd like it if there were more (and non-american) voice
           | options, but I don't think I'd ever want it to be my voice
           | I'm hearing back.
        
             | accrual wrote:
             | Yeah, I wouldn't necessarily want it to be my own voice
             | either, but it would be very cool to make it be the voice
             | of someone I enjoy listening to. :)
        
         | victorbjorklund wrote:
         | Why is replacing voice actors evil? How is it worse than
         | replacing any other job using a machine/software?
        
           | buu700 wrote:
           | Agreed. I think the framing of "stealing" is a needlessly
           | pessimistic prediction of how it might be used. If a person
           | owns their own likeness, it would be logical to implement
           | legal protections for AI impersonations of one's voice. I
           | could imagine a popular voice actor scaling up their career
           | by using AI for a first draft rendering of their part of a
           | script and then selectively refining particular lines with
           | more detailed prompts and/or recording them manually.
           | 
           | This raises a lot of complicated issues and questions, but
           | the use case isn't inherently bad.
        
           | machomaster wrote:
           | The problem is not about replacing actors with technology. It
           | is about replacing the particular actors with their computer-
           | generated voice. It's about likeness-theft.
        
         | spyder wrote:
         | Huh? Replacing human labor with machine is evil? You wouldn't
         | even able to post this comment without that happening, because
         | computers wouldn't exists or we wouldn't have time for that
         | because many of us would work on farms to produce enough food
         | without the use of human-replacing technologies.
         | 
         | In a similar way as machines allowed to produce abundance of
         | food with less labor, the voice AI combined with AI translation
         | can make information more accesible for the world. Voice actors
         | wouldn't be able to voice act all the useful information in the
         | world, (especially for the more niche topics and for the
         | smaller languages) because it wouldn't worth to pay them and
         | humans are also slower to than machines. We are not far from
         | almost realtime voice translation from any language to any
         | other one. Sure, we can do it with text-only translation, but
         | voice makes it more accessible for lot of people. ( For example
         | between 5-10% of the world has dislexya. )
        
         | albert_e wrote:
         | If I am learning new content I can make my own notes and
         | convert them into an audiobook for my morning jog or office
         | commute using my own voice.
         | 
         | If I am a content creator I can generate content more easily by
         | letting my AI voice narrate my slides say. Yes that is cheap
         | and lower quality than a real narrator who can deliver more
         | effective real talks ...but there is a long tail of mediocre
         | content on every topic. Who cares as long as I am having fun,
         | sharing stuff, and not doing anything illegal or wrong.
        
         | pksebben wrote:
         | There's a huge gap in uses where listenable, realistic voice is
         | required, but the text to be spoken is not predetermined. Think
         | AI agents, NPCs in dynamically generated games, etc. These
         | things are currently not really doable with the current crop of
         | TTS because either they take too long to run or they sound
         | awful.
         | 
         | I think the bulk of where this stuff will be useful isn't
         | really visible yet b/c we haven't had the tech to play around
         | with enough.
         | 
         | There is also certainly a huge swath of bad-actor stuff that
         | this is good for. I feel like a lot of the problems with modern
         | tech falls under the umbrella of "We're not collectively mature
         | enough to handle this much power" and I wish there were a
         | better solution for all of that.
        
           | gremlinsinc wrote:
           | eh, you mean the solution isn't, so here's even more power...
           | see you next week!
        
         | paczki wrote:
         | The ability to use my own voice in other languages so I can do
         | localization on my own youtube videos would be huge.
         | 
         | With game development as well, being able to be my own voice
         | actor would save me an immense amount of money that I do not
         | have and give me even more creative freedom and direction of
         | exactly what I want.
         | 
         | It's not ready yet, but I do believe that it will come.
        
           | Capricorn2481 wrote:
           | People are already doing this and it was hugely controversial
           | in The Finals
        
         | dougmwne wrote:
         | It seems like it would be great for any kind of voiceover work
         | or any recorded training or presentation. If you want to
         | correct a mis-speak or add some information, instead of re-
         | recording the entire segment, you could seamlessly update a few
         | words or sentences.
        
         | thatguysaguy wrote:
         | I'm 100% going to clone my voice and use it on my discord bot.
        
         | andrewmcwatters wrote:
         | I want to preserve samples of my voice as I age so that when
         | voice replication technology improves in the future, I can hear
         | myself from a different time of my life in ways that are not
         | prerecorded.
         | 
         | I would also like to give my children this as a novelty of
         | preserved family history so if I so desire, I can have fun with
         | them by letting them hear me from different ages.
        
         | thatguysaguy wrote:
         | To think of non-evil versions just consider cases where right
         | now there's no voice actor to replace, but you could add a
         | voice. E.g. indie games.
        
         | drusepth wrote:
         | Super-niche use-case: our game studio prototyped a multiplayer
         | horror game where we played with cloning player voices to be
         | able to secretly relay messages to certain players as if it
         | came from one of their team-mates (e.g. "go check out below
         | deck" to split a pair of players up, or "I think Bob is trying
         | to sabotage us" to sew inter-player distrust, etc).
         | 
         | Less-niche use-case: if you use TTS for voice-overs and/or NPC
         | dialogue, there can still be a lot of variance in speech
         | patterns / tone / inflections / etc when using a model where
         | you've just customized parameters for each NPC -- using a
         | voice-clone approach, upon first tests, seems like it might
         | provide more long-term consistency.
         | 
         | Bonus: in a lot of voiced-over (non-J)RPGs, the main character
         | is text-only (intentionally not voiced) because they're often
         | intended to be a self-insert of the player (compared to JRPGs
         | which typically have the player "embody" a more fleshed-out
         | player with their own voice). If you really want to lean into
         | self-insert patterns, you could have a player provide a short
         | sample of their voice at the beginning of the game and use that
         | for generating voice-overs for their player character's
         | dialogue throughout the game.
        
           | Terr_ wrote:
           | The idea of a personalized protagonist voice is interesting,
           | but I'd worry about some kind of uncanny valley where it
           | sounds like myself but is using the wrong word-choices or
           | inflections.
           | 
           | Actually, getting it to sound "like myself" in the first
           | place is an extra challenge! For many people even actual
           | recordings sound "wrong", probably because your self-
           | perception involves spoken sound being transmitted through
           | your neck and head, with a different blend of frequencies.
           | 
           | After that is solved, there's still the problem of bystanders
           | remarking: "Is that supposed to sound like you? It doesn't
           | sound like you."
        
         | sunshine_reggae wrote:
         | You forgot plausible deniability, AKA "I never said that".
        
       | starwin1159 wrote:
       | Cantonese can't be imitated
        
         | paulryanrogers wrote:
         | Why?
        
         | Zambyte wrote:
         | How do people learn it?
        
       | trollied wrote:
       | Look up iPhone "personal voice". People don't seem to know about
       | it.
        
       | burcs wrote:
       | There's a "vocal fry" aspect to all of these voice cloning tools,
       | a sort of uncanny valley where they can't match tones correctly,
       | or get fully away from this subtle Microsoft Sam-esque
       | breathiness to their voice. I don't know how else to describe it.
        
         | blackqueeriroh wrote:
         | Yeah, this is why I'm nowhere near worried about this replacing
         | voice actors for the vast majority of work they currently get
         | paid for.
        
       | Fripplebubby wrote:
       | Really interesting! Reading the paper, it sounds like the core of
       | it is broken into two things:
       | 
       | 1. Encoding speech sounds into an IPA-like representation,
       | decoding IPA-like into target language
       | 
       | 2. Extracting "tone color", removing it from the IPA-like
       | representation, then adding it back in into the target layer
       | (emotion, accent, rhythm, pauses, intonation)
       | 
       | So as a result, I am a native English speaker, but I could hear
       | "my" voice speaking Chinese with similar tone color to my own! I
       | wonder, if I recorded it, and then did learn to speak Chinese
       | fluently, how similar it would be? I also wonder whether there is
       | some kind of "tone color translator" that is needed to translate
       | the tone color markers of American English into the relevant ones
       | for other languages, how does that work? Or is that already
       | learned as part of the model?
        
       | Havoc wrote:
       | Tried it locally - can't get anywhere near the clone quality of
       | the clips on their site.
       | 
       | Not even close. Perhaps I'm doing something wrong...
        
       | pantsforbirds wrote:
       | I wonder if in < 5 years I can make a game with a local LLM + AI
       | TTS to create realistic NPCs. With enough of these tools I think
       | you could make a very cool world-sim type game.
        
         | rcarmo wrote:
         | I'm much more interested in the dismal possibility of using
         | this in politics. Nation state actors, too.
        
       | treprinum wrote:
       | Did this just obliterate ElevenLabs?
        
         | htrp wrote:
         | Eleven's advantage is being able to have consistent outputs
         | through high quality training data.
        
       | akashkahlon wrote:
       | So every novel is a movie soon by the author itself using Sora
       | and with Audio buys from all the suitable actors
        
         | rcarmo wrote:
         | This can't really do a convincing Sean Connery yet.
        
         | _zoltan_ wrote:
         | just by more NVDA. :-)
        
         | Multicomp wrote:
         | I hope so, then those of us who want to tell a story (writers,
         | whether comic or novellist or short story or screenplay or
         | teleplay or whatever) will be able to compete more and more on
         | quality and richness of the story copy and content to the
         | audience, not with the current comparative advantage of media
         | choices being made for most storytellers based on difficulty to
         | render.
         | 
         | Words on page are easier than still photos, which are easier
         | than animation, which are easier than live-action TV, which are
         | easier than IMAX movies etc.
         | 
         | If we move all of the rendering of the media into automation,
         | then its just who can come up with the best story content, and
         | you can render it whatever way you like: book, audiobook,
         | animation, live action TV, web series, movie, miniseries,
         | whatever you like.
         | 
         | Granted - the AI will come for us writers to, it already is in
         | some cases. Then the Creator Economy itself will be consumed
         | with eventually becoming 'who can meme the fastest' on an
         | industrial scale for daily events on the one end, and who has
         | taken the time to paint / playact / do rendering out in the
         | real world.
         | 
         | But I sure would love to be able to make a movie out of my
         | unpublished novel, and realistically today, that's impossible
         | in my lifetime. Do I want the entire movie-making industry to
         | die so I and others like me can have that power? No. But if the
         | industry is going to die / change drastically anyways due to
         | forces beyond my control, does that mean I'm not going to take
         | advantage of the ability? Still no.
         | 
         | IDK. I don't have all the answers to this.
         | 
         | But yes, this (amazingly accurate voice cloner after a tiny
         | clip?! wow) product is another step towards that brave new
         | world.
        
       | thorum wrote:
       | OpenVoice currently ranks second-to-last in the Huggingface TTS
       | arena leaderboard, well below alternatives like styletts2 and
       | xtts2:
       | 
       | https://huggingface.co/spaces/TTS-AGI/TTS-Arena
       | 
       | (Click the leaderboard tab at the top to see rankings)
        
         | carbocation wrote:
         | I would like to see the new VoiceCraft model on that list
         | eventually (weights released yesterday, discussion at [1]).
         | 
         | 1 = https://news.ycombinator.com/item?id=39865340
        
         | KennyBlanken wrote:
         | Having gone through almost ten rounds of the TTS Arena, XTT2
         | has tons of artifacts that instantly make it sound non-human.
         | OpenVoice doesn't.
         | 
         | It wouldn't surprise me if people recognize different
         | algorithms and purposefully promote them over others, or alter
         | the page source with a userscript to see the algorithm before
         | listening and click the one they're trying to promote. Looking
         | at the leaderboard, it's obvious there's manipulation going on,
         | because Metavoice is highly ranked but generates absolutely
         | terrible speech with extremely unnatural pauses.
         | 
         | Elevenlabs was scarily natural sounding and high quality; the
         | best of the ones I listened to so far. Pheme's speech overall
         | sounds really natural, but has terrible sound quality, which is
         | probably why it's ranked so well. If Pheme could be higher
         | quality audio, it'd probably match Elevenlabs.
        
         | Jackson__ wrote:
         | As someone who has used the arena maybe ~3 times, the subpar
         | voice quality in the demo linked immediately stood out to me.
        
         | ckl1810 wrote:
         | Is there a benchmark for compute needed? Curious to see if
         | anyone is building / has built a Zoom filter, or Mobile app,
         | whereby I can speak English, and out comes Chinese to the
         | listener.
        
         | c0brac0bra wrote:
         | I'd like to see Deepgram Aura on here.
        
       | lacoolj wrote:
       | That season of 24 is coming true
        
       | yogorenapan wrote:
       | Note: the open source version is watered down compared to their
       | commercial offering. Tried both out and the quality doesn't come
       | close to
        
       | ckl1810 wrote:
       | OpenAI vollies back:
       | 
       | https://twitter.com/OpenAI/status/1773760852153299024
        
       | speedbird wrote:
       | Not convinced. The second reference has a slight Indian accent
       | that isn't carried over into the generated samples.
       | 
       | Training data bias?
        
         | opdahl wrote:
         | What are you talking about? I am not noticing it at all.
        
       | chenxi9649 wrote:
       | I am the most impressed by the cross-lingual voice cloning...
       | 
       | https://research.myshell.ai/open-voice/zero-shot-cross-lingu... I
       | can only speak on their Dutch -> Chinese voice cloning but it's
       | better than anything else I've tried. There is basically no
       | "english/dutch accent" in the chinese at all. Where as the
       | ElevenLabs Chinese voice(cloning or not) is so much worse...
        
       ___________________________________________________________________
       (page generated 2024-03-29 23:01 UTC)