[HN Gopher] Voice Isolator: Strip background noise for film, pod...
___________________________________________________________________
Voice Isolator: Strip background noise for film, podcast, interview
production
Author : davidbarker
Score : 104 points
Date : 2024-07-03 19:51 UTC (3 hours ago)
(HTM) web link (elevenlabs.io)
(TXT) w3m dump (elevenlabs.io)
| IncreasePosts wrote:
| What is the current SOTA for voice->text?
|
| I have a recording I've been sitting on for 2 years(a guest
| lecture which a friend recorded) which contains a very heavy
| amount of background noise, where you can just barely make out
| what is being said by the lecturer. I wonder if there is any hope
| I will ever be able to read a transcript from it.
|
| I can figure out what the lecturer is saying (maybe only because
| I have some context about what he is talking about), but it is
| too painful to sit through 2 hours of it and try to transcribe
| it.
|
| I tried uploading the audio file to this service, but basically
| get nothing useful returned to me.
| bequanna wrote:
| I've had good results with Whisper. You can use the OpenAI API
| and it is also open source:
|
| https://github.com/openai/whisper
| geerlingguy wrote:
| And if you have a Mac, get MacWhisper. It's been a godsend
| for transcribing almost anything. Usually pretty good if the
| main voice is discernible at all--though in OP's case, if the
| main voice is almost indistinguishable it might not do
| amazing.
| pants2 wrote:
| Deepgram Nova 2 is among the best right now, more accurate than
| Whisper in my testing.
| xan_ps007 wrote:
| We have built a dockerized open source stack for Whisper +
| Llama3 + MeloTTS. Whisper and MeloTTS for now works fine for
| our use cases. https://github.com/bolna-
| ai/bolna/tree/master/examples/whisp...
| Prompter9856 wrote:
| You can upload your file and try out Deepgram for free, just
| to see what the results look like for your audio. No harm in
| trying: https://deepgram.com/free-transcription
|
| Disclosure: I work for Deepgram
| ProfessorLayton wrote:
| Try giving Audacity a shot to cleanup the audio, it has a
| built-in noise reduction feature that's configurable. I've used
| it to varying degrees of success, but works especially well
| with the same sounds ANC headphones are good at blocking.
| duped wrote:
| fwiw, the commercial services that do this are called "audio
| forensics" (unsurprisingly, they're usually hired by cops and
| lawyers). You pay them to use their (often expensive) software
| tools to clean up audio and provide a transcription.
|
| I get the appeal of automating this task but the SOTA is not to
| automate it at all.
| willsmith72 wrote:
| you can pay people online trivial amounts of money for this, it
| will be far cheaper and quicker than waiting for the right AI
|
| by the way, even once we get to a sufficient AI, how do you
| verify it without listening to the whole thing anyway?
|
| it's only 2 hours, if you're a fast typer at max it would take
| you 1 work day to transcribe yourself, or <$200 by a
| professional
| IncreasePosts wrote:
| It's not an issue of typing speed, it's that puzzling out
| what was said takes a number of re-listens, at high gain
| which hurts my ears after a while since the voice is just
| barely above the noise floor.
| dbspin wrote:
| I don't know about SOTA, but 'Adobe Podcast Studio' a web app
| that I believe is still free / in beta offers excellent sound
| cleanup. So much so that many podcast / radio producers I know
| no longer frequently use Izotope RX - one of the industry
| standard tools. Adobe are obviously horrendous, but if its for
| a one time use I'd give it a go. The feature you want is the
| 'enhance speech filter'.
|
| https://podcast.adobe.com/enhance
| wkcheng wrote:
| This is really helpful, thanks! I have a bunch of audio that
| I need to clean up and this looks like it could fit the bill.
|
| Do you know if there are any license issues with this? I
| don't see any license page--will they train/retain the
| recording?
| echelon wrote:
| STT: Whisper
|
| TTS: GPTSOVITS / StyleTTS2
|
| VTV: RVCv2
|
| Open source isn't really doing a great job at voice, music,
| or video. It's managing to keep up in LLM and image spaces,
| but it's falling far behind in the multimedia department.
| dougdonohoe wrote:
| How much does it cost in the FAQ: Voice Isolator costs 1000
| characters for every minute of audio.
|
| Since when are characters a currency?
| icepat wrote:
| This confused me as well, and made me lose interest.
| Intentional non-answers like this are rather grating.
| samspenc wrote:
| This is par for the course for any text-to-speech or speech-to-
| text service these days, check out ElevenLabs' other service
| pricing and it is similar - they have monthly pricing but the
| character usage is capped at each level.
|
| Actually other major cloud providers, including AWS, Azure and
| GCP, have similar character / token / word count based pricing
| as well.
| recursive wrote:
| Text-to-speech and speech-to-text, both involve text, which
| has characters. Notably, this does not.
| chankstein38 wrote:
| I'm not trying to argue with you but just want to point
| out, this is why they made an equivalence between time and
| characters. We know 1min of audio == 1000 characters so now
| we know how to translate characters to time.
| recursive wrote:
| If that's really it, then why the obtuse intermediate
| conversion unit? Just bill by the minute.
| Rebelgecko wrote:
| They charge in characters per time, not money per character
| IvyMike wrote:
| I agree with your annoyance here.
|
| I did end up clicking thru to get the full story. On their
| pricing page (https://elevenlabs.io/pricing) they are up front,
| with several monthly tiers; their "most popular" $11/month tier
| says "100k Characters/mo (~120 mins audio)"
| recursive wrote:
| Their pricing page is full of weird uses of "character" as a
| unit too. Things like "monthly character limit", "additional
| character pricing", and so on.
|
| https://elevenlabs.io/pricing
|
| Nowhere is this novel usage of "character" defined. I know
| about text characters, and story characters. But this seems to
| be different. It's hard to imagine why they didn't define what
| they mean by "character" or just make the pricing model more
| straight-forward.
| saghm wrote:
| I know this is a common kneejerk reaction nowadays, but it's
| hard not to wonder if this is due to using AI to generate the
| text on these pages.
| slama wrote:
| I assure you they didn't use an LLM to invent their pricing
| strategy. Elevenlabs subscription levels are designed
| around converting text-to-speech in the primary use-case.
| They charge by the character when converting text to
| speech, so characters are kind of the currency on their
| site. 1000 characters per minute makes sense in that
| context and I find it surprisingly expensive compared to
| generation
| saghm wrote:
| I still don't think I understand the pricing model based
| on your additional info. If characters are currency that
| you buy with real money, what does "characters per
| minute" mean? I guess if each character is $0.01 and then
| you want 20 minutes of audio, you can say that's $10.00
| per minute for 20 minutes, but that means that the number
| of characters in the actual text wouldn't affect it at
| all, so...why even make up a different currency then?
| DidYaWipe wrote:
| Not to mention: What if the vocal portion of an audio
| clip doesn't translate to characters? What if it's all
| "oooh" and "ahhh," as in a choral segment?
| IanCal wrote:
| Do you not know of character meaning, for example, a single
| letter?
| its_ethan wrote:
| I believe that's exactly what he means with "text
| characters".. as in a character (or letter) of some text.
| JLCarveth wrote:
| > text characters
| TaylorAlexander wrote:
| "Costs 1000 characters for every minute of audio" suggests
| this is not about the text characters in the sample. It
| makes it sound like a form of digital credit.
| lcnPylGDnU4H9OF wrote:
| Definitely a form of credit. Of course it seems to be
| rather obtuse but it's still possible to make some sense
| of it given the information on their pricing page.
|
| The definitely-most-popular Creator price point is
| 100,000 "characters" for $22, meaning, according to the
| FAQ, 100 minutes of audio listening costs $22. Not sure
| why they can't just say "100 listening minutes" or
| whatever.
|
| Though I just noticed they also claim that the 100k
| characters is ~120 minutes of audio but 30k is 30 minutes
| of audio. I'm not sure where they're getting their
| numbers but it looks like they're either being dishonest
| about the former or underselling the latter.
| recursive wrote:
| I know that meaning, as well as the meaning from stories.
| What I don't know is what this has to do with removing
| background noise from audio.
| mgkimsal wrote:
| "Audio generation consumes characters. 1k characters
| approximate 1 minute of audio. Character counts reset each
| billing cycle without rollover."
| kps wrote:
| This comment gets me 3.06 seconds of noise removal.
| jonas21 wrote:
| The company started out doing text-to-speech and created
| different pricing tiers based on number of characters in the
| input text. Now they're branching out into other things, but
| want to keep the same pricing plans, so the unit is still
| characters.
| DidYaWipe wrote:
| That's lame, akin to selling cars by using pears as a unit of
| exchange.
| chankstein38 wrote:
| Knowing how elevenlabs works, that's also pretty crazily
| expensive. Imagine being someone who has a 4hr podcast they
| want to feed through this. Oh ok just need 240,000 characters!
| kiicia wrote:
| It's like premium currency in games that don't want you to
| realize how much real money you are spending and premium
| currency packets are always constructed so that you overspend
| because you cannot get exact amount needed
| throwup238 wrote:
| They also just announced licensed celebrity voices in their
| Reader app this past week.
|
| Judy Garland, Burt Reynolds, Laurence Olivier, and James Dean are
| the first ones.
| simonw wrote:
| All four of whom are deceased. I guess they licensed from their
| estates?
| pfdietz wrote:
| In California, such rights last 70 years, so Dean loses
| protection next year, Garland in 2039, and the others much
| later as they died fairly recently.
| taraparo wrote:
| I prefer https://product.supertone.ai/clear which is one time
| payment and not subscription based.
| IvyMike wrote:
| The one thing I hate about this: There are so-called "first
| amendment auditors", who professionally annoy people on the
| street, trying to provoke a reaction. They monetize the resulting
| video on youtube.
|
| You _used_ to be able to pull out your phone and play Disney
| soundtracks or Taylor Swift music which would result in the video
| being non-monetizable. But improvements in audio isolation
| techniques have now defeated this countermeasure. Being a
| professional annoyance is once again a career choice.
|
| Edit: this is one instance I've personally seen:
| https://www.instagram.com/p/C7IEFxQSJQw/?hl=en&img_index=1
| lannisterstark wrote:
| >professionally annoy people on street
|
| >Provoke a response
|
| They mostly do it to cops and people in authority. It's their
| right to do so, they should be able to. They expose so many
| cops and authoritarians who blatantly do not respect citizens'
| civil rights. Good for them.
|
| The fact that "oh no you're annoying me I'm going to arrest you
| because you're annoying" is even a talking point from you is
| baffling.
| IvyMike wrote:
| > They mostly do it to cops and people in authority.
|
| In Santa Barbara there is a group that targets random
| businesses; random shops and restaurants with outdoor eating.
|
| It sucks for the business, it sucks for their clients, it
| sucks for random people walking by on the street.
|
| I'm all for limiting the unchecked authority we give police,
| we need to end qualified immunity, etc. But we should take
| the problem on directly. And I'm all for filming cops who
| abuse their privilege. But the reality I've seen in person is
| this is sucky.
|
| > The fact that "oh no you're annoying me I'm going to arrest
| you because you're annoying" is even a talking point from you
| is baffling.
|
| Who are you replying to? What did I say that's even close to
| this. Talk about baffling.
| j-bos wrote:
| Same with police and citizen videographers. Being a citizen
| reporter is once again an option.
| recursive wrote:
| This technique relies on bizarrely over-powerful intellectual
| property infringement counter-measures built into youtube, the
| platform. Relying on it gives me serious XKCD spacebar-heater
| vibes. It wasn't designed for that.
|
| Yes, people who are assholes in public are annoying.
| Shoplifting and bank robbing are probably also career choices.
| Don't rely on a side effect of "big copyright" systems to save
| us.
| tallytarik wrote:
| I know the sorts of "creators" you're talking about, but I've
| never heard of this as a response before.
|
| Are there really that many people who 1) are aware that this
| could be effective, and 2) are quick witted enough to pull
| their phone out and play music in response to being harassed?
| thenewwazoo wrote:
| Yes, police do it, in fact:
| https://www.vice.com/en/article/z3n75x/police-disney-
| music-c...
| TechDebtDevin wrote:
| While 1st ammendment auditors are cringe, they only annoy
| police. This comment is definately pro LE coded and the irony
| does not escape me with its criticism of people expressing
| their 1st amendment (regardless of how annoying their
| method).
|
| Perhaps we should demonitize every form of journalism and
| media that annoys this guy!
| PierceJoy wrote:
| I'm not sure what videos you're watching, but the majority
| of the ones that pop up on my feeds are them annoying non-
| LE government workers and regular people trying to use
| government services like the post office, passport office,
| etc. Yes, the police show up eventually, but only after
| they've harassed people just trying to do their jobs and
| live their life.
| IvyMike wrote:
| The ones I've seen _in person_ do not "only annoy police";
| they may have been trying to provoke a police response, but
| they were just harassing normies on the street as well.
| Linking to the ones I've seen:
| https://www.instagram.com/p/C7IEFxQSJQw/?hl=en&img_index=1
| lacoolj wrote:
| You mean like "Billy on the Street"? Or something closer to the
| "comedy gang" from Viral Hit?
|
| Can't really imagine why you would both give a response to an
| interview-style question, while being recorded, and
| simultaneously not want that response to be public. Or are they
| doing it secretly?
| IvyMike wrote:
| I added this to the original post, but here's the incident
| that I saw that made me aware of the whole scene:
| https://www.instagram.com/p/C7IEFxQSJQw/?hl=en&img_index=1
| leobg wrote:
| Why not just sue the hell out of them? Would also break their
| business model really fast.
| office_drone wrote:
| Because there's no legal basis to sue them.
| mrtesthah wrote:
| Maybe there is, maybe there isn't. But they'd be forced to
| pay exorbitant fees to lawyers regardless.
| office_drone wrote:
| If a judge finds that the lawsuit was frivolous then
| their exorbitant lawyer fees are now yours to pay.
| TechDebtDevin wrote:
| Cops on HN!? Just a heads up, you have a choice not to watch
| and inadvertently reward people who annoy you on the internet.
| aftbit wrote:
| >You used to be able to pull out your phone and play Disney
| soundtracks or Taylor Swift music which would result in the
| video being non-monetizable. But improvements in audio
| isolation techniques have now defeated this countermeasure.
|
| In my opinion, this is a bug, not a feature. If you pull out
| your phone and play Taylor Swift, you are in fact making a
| public performance without permission. Even if you had
| permission (as some cops allegedly do to use some bands music
| for this purpose), this is not the correct method to deal with
| professional annoyances.
|
| As a police officer, your job is to be the adult in the room.
| Society is trusting you with a tremendous amount of power. If
| you can't handle some annoying whiny YouTubers professionally
| without using "countermeasures", you should hang up your badge
| and get another job.
| tomaskafka wrote:
| Are there actual before/after samples? I'm sure as hell not
| sending samples of my voice to AI voice cloning company.
| ygjb wrote:
| I mean, you don't have to?
|
| Set up an audio source, for example, your phone, playing a
| reasonable length of talking, for example a youtube video, or a
| podcast on spotify. Then record from your computer or other
| recording device, and test with that?
| Murky3515 wrote:
| Please think twice before sharing your personal voice samples
| with a random online website just because they offer a cool demo.
| Workaccount2 wrote:
| I suspect in the near future people are going to report
| randomly hearing themselves in advertisements.
| recursive wrote:
| Friends and loved ones seems like a better method.
| Eisenstein wrote:
| Most people either wouldn't recognize their own voice or
| would hate hearing it.
| latentsea wrote:
| Imagine if they were able to make a model that's your
| voice but the way that you hear it. That'd be so neat.
| You could hear how other people hear their own voice and
| have fun playing with it for an afternoon before moving
| onto the next shiny new toy.
| jdprgm wrote:
| Elevenlabs has some pretty cool stuff but I really despise how
| it's all cloud based. Wish there was an audio ai company
| following a path similar to what topaz has been doing for
| video/photo ai with desktop software. Open source has been
| lagging more than I expected in this area too.
| echelon wrote:
| GPTSOVITS, StyleTTS2, and RVCv2 are still the open source SOTA
| for TTS and voice conversion. These models are unfortunately
| really far behind Elevenlabs' offerings. We're not much further
| along than the Tacotron2 (2018) days.
|
| Elevenlabs is the only model company I can think of that is
| ahead of everyone else in their category. Video and LLMs are
| hyper competitive, but voice is a one-company game. Elevenlabs
| hired up everyone in the space and utterly dominates.
|
| I'm hoping this changes. They've been in pole position for over
| a year and a half now with nobody even coming close.
|
| There's probably a reason why they're so research-oriented. The
| minute an open source model is released that rivals Elevenlabs
| in quality, they're in big trouble. There's absolutely zero
| moat for their current products and there are fifty companies
| nipping at their heels that want to be in the same spot.
| Elevenlabs' current margins are juicy.
| chmars wrote:
| How is this different from Auphonic?
|
| https://auphonic.com/features
| SkyPuncher wrote:
| Why does it have to be different?
| andrewstuart wrote:
| Tried it with several files.
|
| It didn't seem to do much better than audio filters for ffmpeg
| that have been tuned for removing background noise and enhancing
| voice. Maybe I'm missing something or using the wrong source
| data.
| ec109685 wrote:
| I had very loud background music playing, and while it could
| completely eliminate that (impressive!), the voice was much more
| garbled then when there wasn't any background noise playing.
| almog wrote:
| I'd like to have something else but for live calls: a process
| that takes two audio inputs and "subtracts" the noise from one
| input from the other. My use case would be to have two dynamic
| microphones, one directed at the window and one that I'm using
| for a conference call. I'm assuming having two inputs should make
| the process easier for real time (20ms?) processing and might
| require less compute.
|
| If such process can output a clear sound, I could chain it with
| Blackhole and have it and use the processed clear signal as an
| input for the call.
| chankstein38 wrote:
| Assuming that this is setup so that the same sound is coming
| through both microphones just one with your voice on top, you
| could theoretically do this just by feeding it through
| something that inverts the polarity of the "to be cancelled
| out" sound and overlays the two sounds. I'm sure it wouldn't be
| perfect but you might be able to tune it to properly do it.
| This is how a active noise cancelling works!
| almog wrote:
| Thank you for the idea. I've tried in the past to do
| something similar but couldn't get it right. I did try to
| rely on ideas from ANC but my domain knowledge is very
| lacking. It's been over 2 years since so I might need to give
| it another chance and see if any off the shelf
| library/product has been released since then.
| CPLX wrote:
| That doesn't work though, the requirement for the timing to
| align waveform by waveform is too high, and the speed of
| sound is too slow. Also the frequency response isn't going to
| match exactly.
|
| To do it right you really want digital analysis.
| inhumantsar wrote:
| DeepFilterNet doesn't use a second microphone but it does do an
| absurdly good job of removing not-speech from inputs in
| realtime. Check out the demo video linked in the README. iirc
| they demonstrate removing guitar sounds and even a vacuum
| cleaner.
|
| It does take some technical elbow grease to integrate but I've
| used it in calls and while gaming on Linux via Pipewire to
| great effect.
|
| https://github.com/Rikorose/DeepFilterNet
| almog wrote:
| The demo looks very impressive, I'll try it out, thank you!
| tapoxi wrote:
| Don't most smartphones these days have a second noise
| cancelling microphone?
| almog wrote:
| Yes but I was looking for a way to run a process that does
| that on your desktop because I'd like to do ANC using two
| dynamic mics.
| DidYaWipe wrote:
| I think they do. I've also had huge problems with Zoom when
| talking to my parents, because for some reason they are
| aggressively muted for several seconds while nobody else on
| the call is canceled like that. If anyone else on the call so
| much as clears his throat, my parents are muted and we all
| have to sit silently waiting for them to be able to talk.
| Annoying as shit.
|
| I suspect that this is noise cancellation that's failing
| because they keep their phone far away from themselves, to
| fit two people in the shot; and audio is bouncing off the
| walls or otherwise suffering enough delay to mess it up.
| conception wrote:
| Krisp does a pretty great job of this currently without the two
| mics.
| almog wrote:
| Thanks, haven't tried it yet. Are you using it with a dynamic
| mic? Did you get the same crisp detailed sound as you get
| when using the mic as a raw input?
| DidYaWipe wrote:
| You'd think every DAW would have something like this: Subtract
| everything that's stereo (AKA keep only the sound that's
| present in both channels).
|
| I have old mono records that I wanted to clean up. In that
| case, any stereo content is obviously scratches and surface
| noise, so removing it would be most of the job. But nope... not
| one DAW offered this filter, despite offering the opposite
| (removing mono content and keeping the stereo).
|
| And yes I did try removing the mono content and then
| subtracting the result from the full source, but this didn't
| work; I don't remember (or know) why.
| efilife wrote:
| maybe you just didn't know how to do it. You can do this
| easily in FL studio (even free version) using stereo shaper
| DidYaWipe wrote:
| Never heard of "FL studio."
| gtvwill wrote:
| That's pretty interesting. I don't suppose you could do it
| with some manual physics/electrical engineering wizardry like
| Dave Rat uses in this video for canceling out audio for a
| centre speaker?
|
| https://youtu.be/AxZOv0baN2Y?si=fc51MQHRItT6nYKI
| recursive wrote:
| Voxengo MSED is a free VST that can set gain/level on mid and
| side independently. https://www.voxengo.com/product/msed/
| DidYaWipe wrote:
| Cool, thanks. I'll check it out.
| dayjah wrote:
| My test sample, me talking with my baby babbling in the
| background, returned a silent audio track. I guess I nor the baby
| are considered signal ~_~
| CSSer wrote:
| I'm sorry you had to find out this way, Deckard (Rachel?).
| fbnspl wrote:
| If you're looking for a fair- and non-confusing-priced web app
| for creators, an API or real-time SDK for voice isolation, give
| our solution a try: https://ai-coustics.com/
| rexreed wrote:
| Looks good to me! Is this for video only or can you also upload
| m4a and mp3?
| simshay wrote:
| I have used ai|coustics previously and I think their output
| quality is way better than Eleven Labs or Auphonic. They really
| do a good job there.
| gtvwill wrote:
| Or I could just download virtual dj and run it for free on a
| computer and just do this locally, right now, with zero fancy
| hardware and arguably some of the best stems algorithms on the
| market.
| dc3k wrote:
| i think i'll stick to nvidia broadcast for this
___________________________________________________________________
(page generated 2024-07-03 23:00 UTC)