[HN Gopher] Voice Isolator: Strip background noise for film, pod...
       ___________________________________________________________________
        
       Voice Isolator: Strip background noise for film, podcast, interview
       production
        
       Author : davidbarker
       Score  : 104 points
       Date   : 2024-07-03 19:51 UTC (3 hours ago)
        
 (HTM) web link (elevenlabs.io)
 (TXT) w3m dump (elevenlabs.io)
        
       | IncreasePosts wrote:
       | What is the current SOTA for voice->text?
       | 
       | I have a recording I've been sitting on for 2 years(a guest
       | lecture which a friend recorded) which contains a very heavy
       | amount of background noise, where you can just barely make out
       | what is being said by the lecturer. I wonder if there is any hope
       | I will ever be able to read a transcript from it.
       | 
       | I can figure out what the lecturer is saying (maybe only because
       | I have some context about what he is talking about), but it is
       | too painful to sit through 2 hours of it and try to transcribe
       | it.
       | 
       | I tried uploading the audio file to this service, but basically
       | get nothing useful returned to me.
        
         | bequanna wrote:
         | I've had good results with Whisper. You can use the OpenAI API
         | and it is also open source:
         | 
         | https://github.com/openai/whisper
        
           | geerlingguy wrote:
           | And if you have a Mac, get MacWhisper. It's been a godsend
           | for transcribing almost anything. Usually pretty good if the
           | main voice is discernible at all--though in OP's case, if the
           | main voice is almost indistinguishable it might not do
           | amazing.
        
         | pants2 wrote:
         | Deepgram Nova 2 is among the best right now, more accurate than
         | Whisper in my testing.
        
           | xan_ps007 wrote:
           | We have built a dockerized open source stack for Whisper +
           | Llama3 + MeloTTS. Whisper and MeloTTS for now works fine for
           | our use cases. https://github.com/bolna-
           | ai/bolna/tree/master/examples/whisp...
        
           | Prompter9856 wrote:
           | You can upload your file and try out Deepgram for free, just
           | to see what the results look like for your audio. No harm in
           | trying: https://deepgram.com/free-transcription
           | 
           | Disclosure: I work for Deepgram
        
         | ProfessorLayton wrote:
         | Try giving Audacity a shot to cleanup the audio, it has a
         | built-in noise reduction feature that's configurable. I've used
         | it to varying degrees of success, but works especially well
         | with the same sounds ANC headphones are good at blocking.
        
         | duped wrote:
         | fwiw, the commercial services that do this are called "audio
         | forensics" (unsurprisingly, they're usually hired by cops and
         | lawyers). You pay them to use their (often expensive) software
         | tools to clean up audio and provide a transcription.
         | 
         | I get the appeal of automating this task but the SOTA is not to
         | automate it at all.
        
         | willsmith72 wrote:
         | you can pay people online trivial amounts of money for this, it
         | will be far cheaper and quicker than waiting for the right AI
         | 
         | by the way, even once we get to a sufficient AI, how do you
         | verify it without listening to the whole thing anyway?
         | 
         | it's only 2 hours, if you're a fast typer at max it would take
         | you 1 work day to transcribe yourself, or <$200 by a
         | professional
        
           | IncreasePosts wrote:
           | It's not an issue of typing speed, it's that puzzling out
           | what was said takes a number of re-listens, at high gain
           | which hurts my ears after a while since the voice is just
           | barely above the noise floor.
        
         | dbspin wrote:
         | I don't know about SOTA, but 'Adobe Podcast Studio' a web app
         | that I believe is still free / in beta offers excellent sound
         | cleanup. So much so that many podcast / radio producers I know
         | no longer frequently use Izotope RX - one of the industry
         | standard tools. Adobe are obviously horrendous, but if its for
         | a one time use I'd give it a go. The feature you want is the
         | 'enhance speech filter'.
         | 
         | https://podcast.adobe.com/enhance
        
           | wkcheng wrote:
           | This is really helpful, thanks! I have a bunch of audio that
           | I need to clean up and this looks like it could fit the bill.
           | 
           | Do you know if there are any license issues with this? I
           | don't see any license page--will they train/retain the
           | recording?
        
           | echelon wrote:
           | STT: Whisper
           | 
           | TTS: GPTSOVITS / StyleTTS2
           | 
           | VTV: RVCv2
           | 
           | Open source isn't really doing a great job at voice, music,
           | or video. It's managing to keep up in LLM and image spaces,
           | but it's falling far behind in the multimedia department.
        
       | dougdonohoe wrote:
       | How much does it cost in the FAQ: Voice Isolator costs 1000
       | characters for every minute of audio.
       | 
       | Since when are characters a currency?
        
         | icepat wrote:
         | This confused me as well, and made me lose interest.
         | Intentional non-answers like this are rather grating.
        
         | samspenc wrote:
         | This is par for the course for any text-to-speech or speech-to-
         | text service these days, check out ElevenLabs' other service
         | pricing and it is similar - they have monthly pricing but the
         | character usage is capped at each level.
         | 
         | Actually other major cloud providers, including AWS, Azure and
         | GCP, have similar character / token / word count based pricing
         | as well.
        
           | recursive wrote:
           | Text-to-speech and speech-to-text, both involve text, which
           | has characters. Notably, this does not.
        
             | chankstein38 wrote:
             | I'm not trying to argue with you but just want to point
             | out, this is why they made an equivalence between time and
             | characters. We know 1min of audio == 1000 characters so now
             | we know how to translate characters to time.
        
               | recursive wrote:
               | If that's really it, then why the obtuse intermediate
               | conversion unit? Just bill by the minute.
        
           | Rebelgecko wrote:
           | They charge in characters per time, not money per character
        
         | IvyMike wrote:
         | I agree with your annoyance here.
         | 
         | I did end up clicking thru to get the full story. On their
         | pricing page (https://elevenlabs.io/pricing) they are up front,
         | with several monthly tiers; their "most popular" $11/month tier
         | says "100k Characters/mo (~120 mins audio)"
        
         | recursive wrote:
         | Their pricing page is full of weird uses of "character" as a
         | unit too. Things like "monthly character limit", "additional
         | character pricing", and so on.
         | 
         | https://elevenlabs.io/pricing
         | 
         | Nowhere is this novel usage of "character" defined. I know
         | about text characters, and story characters. But this seems to
         | be different. It's hard to imagine why they didn't define what
         | they mean by "character" or just make the pricing model more
         | straight-forward.
        
           | saghm wrote:
           | I know this is a common kneejerk reaction nowadays, but it's
           | hard not to wonder if this is due to using AI to generate the
           | text on these pages.
        
             | slama wrote:
             | I assure you they didn't use an LLM to invent their pricing
             | strategy. Elevenlabs subscription levels are designed
             | around converting text-to-speech in the primary use-case.
             | They charge by the character when converting text to
             | speech, so characters are kind of the currency on their
             | site. 1000 characters per minute makes sense in that
             | context and I find it surprisingly expensive compared to
             | generation
        
               | saghm wrote:
               | I still don't think I understand the pricing model based
               | on your additional info. If characters are currency that
               | you buy with real money, what does "characters per
               | minute" mean? I guess if each character is $0.01 and then
               | you want 20 minutes of audio, you can say that's $10.00
               | per minute for 20 minutes, but that means that the number
               | of characters in the actual text wouldn't affect it at
               | all, so...why even make up a different currency then?
        
               | DidYaWipe wrote:
               | Not to mention: What if the vocal portion of an audio
               | clip doesn't translate to characters? What if it's all
               | "oooh" and "ahhh," as in a choral segment?
        
           | IanCal wrote:
           | Do you not know of character meaning, for example, a single
           | letter?
        
             | its_ethan wrote:
             | I believe that's exactly what he means with "text
             | characters".. as in a character (or letter) of some text.
        
             | JLCarveth wrote:
             | > text characters
        
             | TaylorAlexander wrote:
             | "Costs 1000 characters for every minute of audio" suggests
             | this is not about the text characters in the sample. It
             | makes it sound like a form of digital credit.
        
               | lcnPylGDnU4H9OF wrote:
               | Definitely a form of credit. Of course it seems to be
               | rather obtuse but it's still possible to make some sense
               | of it given the information on their pricing page.
               | 
               | The definitely-most-popular Creator price point is
               | 100,000 "characters" for $22, meaning, according to the
               | FAQ, 100 minutes of audio listening costs $22. Not sure
               | why they can't just say "100 listening minutes" or
               | whatever.
               | 
               | Though I just noticed they also claim that the 100k
               | characters is ~120 minutes of audio but 30k is 30 minutes
               | of audio. I'm not sure where they're getting their
               | numbers but it looks like they're either being dishonest
               | about the former or underselling the latter.
        
             | recursive wrote:
             | I know that meaning, as well as the meaning from stories.
             | What I don't know is what this has to do with removing
             | background noise from audio.
        
           | mgkimsal wrote:
           | "Audio generation consumes characters. 1k characters
           | approximate 1 minute of audio. Character counts reset each
           | billing cycle without rollover."
        
         | kps wrote:
         | This comment gets me 3.06 seconds of noise removal.
        
         | jonas21 wrote:
         | The company started out doing text-to-speech and created
         | different pricing tiers based on number of characters in the
         | input text. Now they're branching out into other things, but
         | want to keep the same pricing plans, so the unit is still
         | characters.
        
           | DidYaWipe wrote:
           | That's lame, akin to selling cars by using pears as a unit of
           | exchange.
        
         | chankstein38 wrote:
         | Knowing how elevenlabs works, that's also pretty crazily
         | expensive. Imagine being someone who has a 4hr podcast they
         | want to feed through this. Oh ok just need 240,000 characters!
        
         | kiicia wrote:
         | It's like premium currency in games that don't want you to
         | realize how much real money you are spending and premium
         | currency packets are always constructed so that you overspend
         | because you cannot get exact amount needed
        
       | throwup238 wrote:
       | They also just announced licensed celebrity voices in their
       | Reader app this past week.
       | 
       | Judy Garland, Burt Reynolds, Laurence Olivier, and James Dean are
       | the first ones.
        
         | simonw wrote:
         | All four of whom are deceased. I guess they licensed from their
         | estates?
        
           | pfdietz wrote:
           | In California, such rights last 70 years, so Dean loses
           | protection next year, Garland in 2039, and the others much
           | later as they died fairly recently.
        
       | taraparo wrote:
       | I prefer https://product.supertone.ai/clear which is one time
       | payment and not subscription based.
        
       | IvyMike wrote:
       | The one thing I hate about this: There are so-called "first
       | amendment auditors", who professionally annoy people on the
       | street, trying to provoke a reaction. They monetize the resulting
       | video on youtube.
       | 
       | You _used_ to be able to pull out your phone and play Disney
       | soundtracks or Taylor Swift music which would result in the video
       | being non-monetizable. But improvements in audio isolation
       | techniques have now defeated this countermeasure. Being a
       | professional annoyance is once again a career choice.
       | 
       | Edit: this is one instance I've personally seen:
       | https://www.instagram.com/p/C7IEFxQSJQw/?hl=en&img_index=1
        
         | lannisterstark wrote:
         | >professionally annoy people on street
         | 
         | >Provoke a response
         | 
         | They mostly do it to cops and people in authority. It's their
         | right to do so, they should be able to. They expose so many
         | cops and authoritarians who blatantly do not respect citizens'
         | civil rights. Good for them.
         | 
         | The fact that "oh no you're annoying me I'm going to arrest you
         | because you're annoying" is even a talking point from you is
         | baffling.
        
           | IvyMike wrote:
           | > They mostly do it to cops and people in authority.
           | 
           | In Santa Barbara there is a group that targets random
           | businesses; random shops and restaurants with outdoor eating.
           | 
           | It sucks for the business, it sucks for their clients, it
           | sucks for random people walking by on the street.
           | 
           | I'm all for limiting the unchecked authority we give police,
           | we need to end qualified immunity, etc. But we should take
           | the problem on directly. And I'm all for filming cops who
           | abuse their privilege. But the reality I've seen in person is
           | this is sucky.
           | 
           | > The fact that "oh no you're annoying me I'm going to arrest
           | you because you're annoying" is even a talking point from you
           | is baffling.
           | 
           | Who are you replying to? What did I say that's even close to
           | this. Talk about baffling.
        
         | j-bos wrote:
         | Same with police and citizen videographers. Being a citizen
         | reporter is once again an option.
        
         | recursive wrote:
         | This technique relies on bizarrely over-powerful intellectual
         | property infringement counter-measures built into youtube, the
         | platform. Relying on it gives me serious XKCD spacebar-heater
         | vibes. It wasn't designed for that.
         | 
         | Yes, people who are assholes in public are annoying.
         | Shoplifting and bank robbing are probably also career choices.
         | Don't rely on a side effect of "big copyright" systems to save
         | us.
        
         | tallytarik wrote:
         | I know the sorts of "creators" you're talking about, but I've
         | never heard of this as a response before.
         | 
         | Are there really that many people who 1) are aware that this
         | could be effective, and 2) are quick witted enough to pull
         | their phone out and play music in response to being harassed?
        
           | thenewwazoo wrote:
           | Yes, police do it, in fact:
           | https://www.vice.com/en/article/z3n75x/police-disney-
           | music-c...
        
           | TechDebtDevin wrote:
           | While 1st ammendment auditors are cringe, they only annoy
           | police. This comment is definately pro LE coded and the irony
           | does not escape me with its criticism of people expressing
           | their 1st amendment (regardless of how annoying their
           | method).
           | 
           | Perhaps we should demonitize every form of journalism and
           | media that annoys this guy!
        
             | PierceJoy wrote:
             | I'm not sure what videos you're watching, but the majority
             | of the ones that pop up on my feeds are them annoying non-
             | LE government workers and regular people trying to use
             | government services like the post office, passport office,
             | etc. Yes, the police show up eventually, but only after
             | they've harassed people just trying to do their jobs and
             | live their life.
        
             | IvyMike wrote:
             | The ones I've seen _in person_ do not  "only annoy police";
             | they may have been trying to provoke a police response, but
             | they were just harassing normies on the street as well.
             | Linking to the ones I've seen:
             | https://www.instagram.com/p/C7IEFxQSJQw/?hl=en&img_index=1
        
         | lacoolj wrote:
         | You mean like "Billy on the Street"? Or something closer to the
         | "comedy gang" from Viral Hit?
         | 
         | Can't really imagine why you would both give a response to an
         | interview-style question, while being recorded, and
         | simultaneously not want that response to be public. Or are they
         | doing it secretly?
        
           | IvyMike wrote:
           | I added this to the original post, but here's the incident
           | that I saw that made me aware of the whole scene:
           | https://www.instagram.com/p/C7IEFxQSJQw/?hl=en&img_index=1
        
         | leobg wrote:
         | Why not just sue the hell out of them? Would also break their
         | business model really fast.
        
           | office_drone wrote:
           | Because there's no legal basis to sue them.
        
             | mrtesthah wrote:
             | Maybe there is, maybe there isn't. But they'd be forced to
             | pay exorbitant fees to lawyers regardless.
        
               | office_drone wrote:
               | If a judge finds that the lawsuit was frivolous then
               | their exorbitant lawyer fees are now yours to pay.
        
         | TechDebtDevin wrote:
         | Cops on HN!? Just a heads up, you have a choice not to watch
         | and inadvertently reward people who annoy you on the internet.
        
         | aftbit wrote:
         | >You used to be able to pull out your phone and play Disney
         | soundtracks or Taylor Swift music which would result in the
         | video being non-monetizable. But improvements in audio
         | isolation techniques have now defeated this countermeasure.
         | 
         | In my opinion, this is a bug, not a feature. If you pull out
         | your phone and play Taylor Swift, you are in fact making a
         | public performance without permission. Even if you had
         | permission (as some cops allegedly do to use some bands music
         | for this purpose), this is not the correct method to deal with
         | professional annoyances.
         | 
         | As a police officer, your job is to be the adult in the room.
         | Society is trusting you with a tremendous amount of power. If
         | you can't handle some annoying whiny YouTubers professionally
         | without using "countermeasures", you should hang up your badge
         | and get another job.
        
       | tomaskafka wrote:
       | Are there actual before/after samples? I'm sure as hell not
       | sending samples of my voice to AI voice cloning company.
        
         | ygjb wrote:
         | I mean, you don't have to?
         | 
         | Set up an audio source, for example, your phone, playing a
         | reasonable length of talking, for example a youtube video, or a
         | podcast on spotify. Then record from your computer or other
         | recording device, and test with that?
        
       | Murky3515 wrote:
       | Please think twice before sharing your personal voice samples
       | with a random online website just because they offer a cool demo.
        
         | Workaccount2 wrote:
         | I suspect in the near future people are going to report
         | randomly hearing themselves in advertisements.
        
           | recursive wrote:
           | Friends and loved ones seems like a better method.
        
             | Eisenstein wrote:
             | Most people either wouldn't recognize their own voice or
             | would hate hearing it.
        
               | latentsea wrote:
               | Imagine if they were able to make a model that's your
               | voice but the way that you hear it. That'd be so neat.
               | You could hear how other people hear their own voice and
               | have fun playing with it for an afternoon before moving
               | onto the next shiny new toy.
        
       | jdprgm wrote:
       | Elevenlabs has some pretty cool stuff but I really despise how
       | it's all cloud based. Wish there was an audio ai company
       | following a path similar to what topaz has been doing for
       | video/photo ai with desktop software. Open source has been
       | lagging more than I expected in this area too.
        
         | echelon wrote:
         | GPTSOVITS, StyleTTS2, and RVCv2 are still the open source SOTA
         | for TTS and voice conversion. These models are unfortunately
         | really far behind Elevenlabs' offerings. We're not much further
         | along than the Tacotron2 (2018) days.
         | 
         | Elevenlabs is the only model company I can think of that is
         | ahead of everyone else in their category. Video and LLMs are
         | hyper competitive, but voice is a one-company game. Elevenlabs
         | hired up everyone in the space and utterly dominates.
         | 
         | I'm hoping this changes. They've been in pole position for over
         | a year and a half now with nobody even coming close.
         | 
         | There's probably a reason why they're so research-oriented. The
         | minute an open source model is released that rivals Elevenlabs
         | in quality, they're in big trouble. There's absolutely zero
         | moat for their current products and there are fifty companies
         | nipping at their heels that want to be in the same spot.
         | Elevenlabs' current margins are juicy.
        
       | chmars wrote:
       | How is this different from Auphonic?
       | 
       | https://auphonic.com/features
        
         | SkyPuncher wrote:
         | Why does it have to be different?
        
       | andrewstuart wrote:
       | Tried it with several files.
       | 
       | It didn't seem to do much better than audio filters for ffmpeg
       | that have been tuned for removing background noise and enhancing
       | voice. Maybe I'm missing something or using the wrong source
       | data.
        
       | ec109685 wrote:
       | I had very loud background music playing, and while it could
       | completely eliminate that (impressive!), the voice was much more
       | garbled then when there wasn't any background noise playing.
        
       | almog wrote:
       | I'd like to have something else but for live calls: a process
       | that takes two audio inputs and "subtracts" the noise from one
       | input from the other. My use case would be to have two dynamic
       | microphones, one directed at the window and one that I'm using
       | for a conference call. I'm assuming having two inputs should make
       | the process easier for real time (20ms?) processing and might
       | require less compute.
       | 
       | If such process can output a clear sound, I could chain it with
       | Blackhole and have it and use the processed clear signal as an
       | input for the call.
        
         | chankstein38 wrote:
         | Assuming that this is setup so that the same sound is coming
         | through both microphones just one with your voice on top, you
         | could theoretically do this just by feeding it through
         | something that inverts the polarity of the "to be cancelled
         | out" sound and overlays the two sounds. I'm sure it wouldn't be
         | perfect but you might be able to tune it to properly do it.
         | This is how a active noise cancelling works!
        
           | almog wrote:
           | Thank you for the idea. I've tried in the past to do
           | something similar but couldn't get it right. I did try to
           | rely on ideas from ANC but my domain knowledge is very
           | lacking. It's been over 2 years since so I might need to give
           | it another chance and see if any off the shelf
           | library/product has been released since then.
        
           | CPLX wrote:
           | That doesn't work though, the requirement for the timing to
           | align waveform by waveform is too high, and the speed of
           | sound is too slow. Also the frequency response isn't going to
           | match exactly.
           | 
           | To do it right you really want digital analysis.
        
         | inhumantsar wrote:
         | DeepFilterNet doesn't use a second microphone but it does do an
         | absurdly good job of removing not-speech from inputs in
         | realtime. Check out the demo video linked in the README. iirc
         | they demonstrate removing guitar sounds and even a vacuum
         | cleaner.
         | 
         | It does take some technical elbow grease to integrate but I've
         | used it in calls and while gaming on Linux via Pipewire to
         | great effect.
         | 
         | https://github.com/Rikorose/DeepFilterNet
        
           | almog wrote:
           | The demo looks very impressive, I'll try it out, thank you!
        
         | tapoxi wrote:
         | Don't most smartphones these days have a second noise
         | cancelling microphone?
        
           | almog wrote:
           | Yes but I was looking for a way to run a process that does
           | that on your desktop because I'd like to do ANC using two
           | dynamic mics.
        
           | DidYaWipe wrote:
           | I think they do. I've also had huge problems with Zoom when
           | talking to my parents, because for some reason they are
           | aggressively muted for several seconds while nobody else on
           | the call is canceled like that. If anyone else on the call so
           | much as clears his throat, my parents are muted and we all
           | have to sit silently waiting for them to be able to talk.
           | Annoying as shit.
           | 
           | I suspect that this is noise cancellation that's failing
           | because they keep their phone far away from themselves, to
           | fit two people in the shot; and audio is bouncing off the
           | walls or otherwise suffering enough delay to mess it up.
        
         | conception wrote:
         | Krisp does a pretty great job of this currently without the two
         | mics.
        
           | almog wrote:
           | Thanks, haven't tried it yet. Are you using it with a dynamic
           | mic? Did you get the same crisp detailed sound as you get
           | when using the mic as a raw input?
        
         | DidYaWipe wrote:
         | You'd think every DAW would have something like this: Subtract
         | everything that's stereo (AKA keep only the sound that's
         | present in both channels).
         | 
         | I have old mono records that I wanted to clean up. In that
         | case, any stereo content is obviously scratches and surface
         | noise, so removing it would be most of the job. But nope... not
         | one DAW offered this filter, despite offering the opposite
         | (removing mono content and keeping the stereo).
         | 
         | And yes I did try removing the mono content and then
         | subtracting the result from the full source, but this didn't
         | work; I don't remember (or know) why.
        
           | efilife wrote:
           | maybe you just didn't know how to do it. You can do this
           | easily in FL studio (even free version) using stereo shaper
        
             | DidYaWipe wrote:
             | Never heard of "FL studio."
        
           | gtvwill wrote:
           | That's pretty interesting. I don't suppose you could do it
           | with some manual physics/electrical engineering wizardry like
           | Dave Rat uses in this video for canceling out audio for a
           | centre speaker?
           | 
           | https://youtu.be/AxZOv0baN2Y?si=fc51MQHRItT6nYKI
        
           | recursive wrote:
           | Voxengo MSED is a free VST that can set gain/level on mid and
           | side independently. https://www.voxengo.com/product/msed/
        
             | DidYaWipe wrote:
             | Cool, thanks. I'll check it out.
        
       | dayjah wrote:
       | My test sample, me talking with my baby babbling in the
       | background, returned a silent audio track. I guess I nor the baby
       | are considered signal ~_~
        
         | CSSer wrote:
         | I'm sorry you had to find out this way, Deckard (Rachel?).
        
       | fbnspl wrote:
       | If you're looking for a fair- and non-confusing-priced web app
       | for creators, an API or real-time SDK for voice isolation, give
       | our solution a try: https://ai-coustics.com/
        
         | rexreed wrote:
         | Looks good to me! Is this for video only or can you also upload
         | m4a and mp3?
        
       | simshay wrote:
       | I have used ai|coustics previously and I think their output
       | quality is way better than Eleven Labs or Auphonic. They really
       | do a good job there.
        
       | gtvwill wrote:
       | Or I could just download virtual dj and run it for free on a
       | computer and just do this locally, right now, with zero fancy
       | hardware and arguably some of the best stems algorithms on the
       | market.
        
       | dc3k wrote:
       | i think i'll stick to nvidia broadcast for this
        
       ___________________________________________________________________
       (page generated 2024-07-03 23:00 UTC)