[HN Gopher] Show HN: AI dub tool I made to watch foreign languag...
       ___________________________________________________________________
        
       Show HN: AI dub tool I made to watch foreign language videos with
       my 7-year-old
        
       Hey HN!  I love watching YouTube with my 7-year-old daughter.
       Unfortunately, the best stuff is often in English (we're German).
       So I made an AI tool that translates videos directly, using the
       original voices. All other sounds, as well as background music, are
       preserved, too.  Turns out that it works for many other language
       pairs, too. So far, it can create dubs in English, Mandarin
       Chinese, Spanish, Arabic, French, Russian, German, Italian, Korean,
       Polish and Dutch.  The main challenge in building this was to get
       the balance right between translating the original meaning and
       getting the timing right. Especially for language pairs like
       English -> German, where the target ist often longer than the
       source ("bat" -> "Fle-der-maus", "speed" -> "Ge-schwin-dig-keit").
       Let me know what you think! :)
        
       Author : leobg
       Score  : 331 points
       Date   : 2024-02-26 16:08 UTC (2 days ago)
        
 (HTM) web link (speakz.ai)
 (TXT) w3m dump (speakz.ai)
        
       | auct wrote:
       | Can you add ukrainian?
        
         | leobg wrote:
         | Hey! As a target/output language? As a source language, it's
         | already supported.
        
           | oneshtein wrote:
           | Yes, as target language. And remove Iaussian language,
           | please, unless you are Iaepublican. You can add it back after
           | the war.
        
       | oldge wrote:
       | Very cool, any chance to see a on device release so we can run
       | this locally? Topaz ai has a pretty good model for this if you
       | are looking to monetize.
        
       | cushpush wrote:
       | This is really amazing I'm very impressed and happy you released
       | this. Can you share more details about your development rhythm of
       | this helpful piece of software?
        
       | bufferoverflow wrote:
       | I speak Russian, and I gotta say, the Lex sample is incredible.
       | It sounds like real dubbing. Maybe not pro-level dubbing, but
       | it's very very good. They voices are also very close to Lex's and
       | Elon's.
       | 
       | Congrats! Very well done.
        
         | leobg wrote:
         | Thank you! Yeah, I also had a big grin on my face, hearing Lex
         | and Elon suddenly talk in another language. :)
         | 
         | Consistency isn't perfect yet, as we're building the voice from
         | scratch basically for each utterance. One the one hand, you
         | want that, because the utterance might be more upbeat, or lower
         | pitch, than the speaker's "average" voice. On the other hand,
         | it sometimes introduces variance that makes the listener's
         | brain go, "Uh.... is that another person speaking now?". If I
         | had to dub 200 videos of a single YouTube channel, I would be
         | able to fine-tune the voices of the main characters, and
         | reserve the ad-hoc cloning for guest characters.
        
       | artninja1988 wrote:
       | Why do you keep the original audio track on the dubbed version? I
       | think it sounds pretty distracting, although you do want to keep
       | sounds other than the original voice I guess
        
         | askhan wrote:
         | I think these little bits of the original sound really works
         | well, helping us hear the original as well and keeping it less
         | uncanny.
         | 
         | What an amazing project!
        
           | leobg wrote:
           | Thank you.
           | 
           | Exactly. I had the OG voice removed at first. But I added it
           | back in for exactly this reason. It also serves as a tool for
           | AI accountability: It lets you "see" that the cloned voice is
           | indeed saying the same thing as the original voice.
           | 
           | That being said, it would be trivial to turn the OG voice off
           | for anyone who wants to.
        
         | maxglute wrote:
         | I dig it, it's like amateur underground over dubs of bootleg
         | hollywood movies. Don't want to conflate amateur AI voice with
         | actual personality on screen. The seperation is part of the
         | experience.
        
         | gardenhedge wrote:
         | News channels do that for interviews
        
       | simple10 wrote:
       | This looks amazing! Thanks for sharing. I signed up for the
       | private beta.
        
         | leobg wrote:
         | Thank you so much!
        
       | lxe wrote:
       | What does this use behind the scenes? This type of stuff can get
       | pretty expensive if you're relying on elevenlabs or heygen.
        
         | brody_hamer wrote:
         | Yea my guess would be elevenlabs, which just recently announced
         | this exact featureset.
        
         | leobg wrote:
         | Let's just say this is NOT a wrapper around Elevenlabs or
         | Heygen. I've looked at commercial voice cloning before. But, as
         | you said, the prices seemed ridiculous.
         | 
         | Before this, I made audiobooks for my daughter. Old, out-of-
         | print books, turned into speech. If I remember correctly, with
         | Elevenlabs a single book would have cost me > $100. At that
         | price level, I can read the damn thing myself. What good is
         | computer generated voice if it isn't at least 10x cheaper than
         | doing it yourself?
         | 
         | I'm just one guy. With me, it's just my time, one or two
         | commercial licenses, and other than that just the raw price of
         | running those GPUs.
        
       | scrollaway wrote:
       | I know this is HN so I don't want to distract from the technical
       | achievement and how genuinely useful this can be.
       | 
       | I also don't want to tell you how to raise your kid. You do you,
       | it's not my family. But I want to share how _important_ it is to
       | watch foreign spoken language movies and TV, especially as a kid,
       | to be able to speak multiple languages later in life. You 'll
       | notice that in every country where TV and movies are regularly
       | dubbed in the local language, the English levels go to shit.
       | Dubbing is partially responsible for this because kids are not
       | exposed to a different language on a regular basis.
       | 
       | I remember wanting to watch a dubbed movie with my mom as a kid,
       | and she told me "We will watch the original instead, dubbed
       | movies don't have a soul". It stuck with me. She was absolutely
       | right. Today I am working on my sixth spoken language. Causation
       | not guaranteed, merely implied.
        
         | true_religion wrote:
         | Counter point. My parents didn't let me watch dubbed shows, and
         | didn't speak our native language because rhetorical wanted me
         | to speak unaffected English.
         | 
         | I can't speak any languages, but in school my English was
         | insanely good. To the point of perfect scores in the college
         | scholastic exams, and when I was in uni for engineering, I took
         | on an English major for fun with essentially no impact on my
         | work load.
         | 
         | You can generalize but you can also specialize.
        
         | Baeocystin wrote:
         | I can't say I agree with dubbed movies having no soul. Greater
         | accessibility to a wider audience is not something to deride,
         | or hold in contempt.
         | 
         | That being said, I do agree that listening to other languages
         | is a great thing. My father was a linguist, and when we would
         | watch subtitled media, we'd play a game where we'd try and hear
         | the cognates, pick out the most common words, figure out the
         | basics of the grammar as we went along. It was a lot of fun!
        
           | leobg wrote:
           | One of my favorite movies was "Scent Of A Woman". But when I
           | watched it in the German-dubbed version, I was appalled. It
           | made the whole movie suddenly seem like a comedy. To me, the
           | translation had killed its "soul", for lack of a better term.
           | 
           | I still want my kids to learn English. And ideally also one
           | or two other languages, like Chinese.
           | 
           | As Nietzsche said:
           | 
           | "So you have mounted on horseback? And now ride briskly up to
           | your goal? Well, my friend - but your lame foot is also
           | riding with you!"
        
         | jeroenhd wrote:
         | >We will watch the original instead, dubbed movies don't have a
         | soul
         | 
         | I disagree. It's all about the quality of the voice actors and
         | the effort put into localisation.
         | 
         | Having grown up on Dutch dubs of many cartoons, I honestly find
         | the Dutch voice actor of Spongebob better than the original.
         | I'm missing the extra energy that the Dutch VA seems to have
         | put into the voice when I hear the original, even if the
         | original is very good. Though text on screen isn't translated,
         | puns and references are, sometimes overhauled completely.
         | 
         | The talent pool for Dutch voice actors isn't as big as I
         | would've liked (you often hear the same five VAs in every show
         | on a given channel), but some of them really put in the work.
         | Many of them only do kids TV and commercials (really freaked me
         | out to hear Ash Ketchum try to sell me soap one day) and not
         | every VA is as good/paid enough/gets decent scripts, but there
         | are some real gems to be found in dubs.
         | 
         | Last year I found out how Ukrainian dubs work and I was
         | astounded by how weird the experience was. I'm used to dubs
         | having only the voice track swapped out, but the Ukrainian
         | shows seemed to just have the acties talk over the original
         | show, like this AI tool does, and I honestly can't imagine ever
         | getting into a show that's dubbed like that. I assume people
         | get used to this, but I found it rather annoying.
         | 
         | Blanket statements like "dubs have no soul" serve nobody. There
         | are good dubs and there are bad dubs, and the ratio will
         | probably differ depending on the language you're talking about.
         | Dismissing all dubs ignores the real heart and soul some
         | dubbing teams have put into their works. That doesn't mean I
         | disagree with the idea of exposing kids to more languages, but
         | I wouldn't expect kids to learn much from just TV shows and
         | movies in the first place.
        
           | imp0cat wrote:
           | I think the main point is that small kids get the basic
           | building blocks for learning languages from anything they
           | hear (even if they don't understand it yet), so listening to
           | as many languages as possible when they are young will make
           | learning languages easier for them later in their lives.
        
             | jeroenhd wrote:
             | I've heard this argument before, mostly from companies
             | trying to sell language courses for kids. As far as I know
             | it's true that kids pick up on languages much easier when
             | they're young, but I'm not so sure those skills will stick
             | if all they can converse with are the TV. That's quite
             | different from having a speaking partner such as a teacher
             | or a bilingual parent. I suspect this is why shows like
             | Dora the Explorer are set up like an interactive game.
             | 
             | I myself have been exposed to subtitled English shows and
             | movies all my life (not every show or movie was dubbed, and
             | there were some German shows that made it through as well)
             | but I don't think I actually started speaking any English
             | until I needed it to interact with strangers in Runescape,
             | while at the same time I stopped watching any dubbed shows.
             | Almost all of the content I consumed became English
             | language content.
             | 
             | Almost passively learning a language by enveloping oneself
             | in it works (though actual study will help you advance
             | quicker), but you need more than TV. I can't find the
             | actual paper I read on this once (thanks, SEO spam!) but as
             | I recall, the biggest advantage kids have to is learn
             | pronunciation without an accent; picking up vocabulary and
             | grammar don't seem to be too affected by age from what I
             | recall.
        
           | jamager wrote:
           | Voices in dubbed movies don't have any depth, for instance.
           | 
           | That doesn't have anything to do with the quality of the
           | voice actors. Everything sounds flat because that is just how
           | they record it.
           | 
           | Dubbing is a useful convenience, an accessibility feature
           | (even if it wasn't born that way). But they have way less
           | soul.
        
             | jeroenhd wrote:
             | I guess we just disagree, or maybe you're used to worse
             | dubs than I am. There's nothing inherently flat about
             | dubbing at all. In fact, in many (older) movies and shows,
             | actors would dub over themselves to get better audio.
        
               | jamager wrote:
               | In a movie, if a character is far away, their voice comes
               | from far away. Voice actors always have the mic in front
               | of them, so their voices always come from the same place,
               | not relative to the scene. That's what I meant.
               | 
               | I also think it is beautiful to hear the sound of the
               | original language, particularly if it is one I am not
               | used to. It's part of the charm.
               | 
               | I have grown up with dubs, thou, so I understand you. But
               | once one gets used to no dubs, there is no way back. It's
               | like removing sugar from the coffee.
        
         | Freak_NL wrote:
         | Mostly I agree with this, but for animated works dubs can be an
         | integral part of the product when done right, and some are even
         | tweaked for different languages (although I strongly reject
         | adjusting the actual cultural content for different locales).
         | The dubs have to be made in concert with the original though.
         | There is also a lot of plain crap out there.
         | 
         | But absolutely; for anything featuring live action, dubs just
         | damage the original.
         | 
         | I watch a German man building his massive Lego city on Youtube
         | (narrated and recorded quite professionally) with my five year
         | old son for a few minutes before bed. He is now at the point
         | where he is trying to give this weird language (to him) a place
         | in his head. Some words are familiar (being Dutch), some are
         | foreign, and you can see the feedback loop happening when words
         | do land; he wants to know what that man is saying. I don't
         | except him to pick any German at this point, but the basics of
         | immersion in another language are there.
        
           | scrollaway wrote:
           | Yes I agree with you. Actually, good-quality dubbed animated
           | movies (= disney) is what I often use to help learn a new
           | language.
        
         | leobg wrote:
         | FYI, I agree with you in all points.
         | 
         | As I said in another comment, I wouldn't want to live in a
         | world where everything was dubbed into my language.
         | 
         | Any translation takes something away from the original. And
         | dubbing even more so.
         | 
         | I also believe that being exposed to a foreign language long
         | before you ever make a concious attempt to learn it is
         | important. I wouldn't think I'd succeed in teaching my toddler
         | to say "Daddy" if he hadn't been listening to the rest of us
         | speaking for many months before.
         | 
         | I can see how this headline can make me seem like a bafoon of a
         | dad. But I think I'm really not. :) When I watch The Anatomy
         | Lab with my daughter, that's a time when I want our
         | conversation to focus on how digestion works. Not on what the
         | guy on the screen was saying just now. But of course there will
         | also be times where I'll want our conversation to be about
         | exactly that: What a foreign speaker just said. How those words
         | come together. How the may have the same root as the words we
         | use in German. Also, while AI has its place, I prefer to have
         | these conversations with her myself.
        
       | sorenjan wrote:
       | I know Germany dub most video, but wouldn't a seven year old be
       | able to read subtitles? It's a great way for her to learn
       | English, it's how most Swedes learn it before starting school. I
       | think there's a pretty strong correlation between countries'
       | average English proficiency and how common dubbing is.
       | 
       | https://haonowshaokao.com/2013/05/18/does-dubbing-tv-harm-la...
       | 
       | Edit: I forgot to mention that the samples on the website is
       | impressive and well made. How do you do the speaker diarization
       | and voice cloning?
        
         | supafastcoder wrote:
         | I think it's a cultural difference. I'm also from a non-dubbing
         | country (Netherlands) and I can't stand dubbed content either.
         | On the other hand people tell me they can't stand subtitles
         | because it "reveals" what they're going to say before they say
         | it.
        
           | lukan wrote:
           | "people tell me they can't stand subtitles because it
           | "reveals" what they're going to say before they say it."
           | 
           | I love watching movies in the original language, but this is
           | something I hate as well, but something that can be avoided.
           | 
           | Some movies get it right, though. The timing, just the words
           | that are spoken and even different colors for different
           | persons speaking (very rare, cannot even remember where I
           | have seen it). That should be standard, but with most movies
           | you can be lucky if the subs even match the plot and do not
           | reveal too much.
        
             | crtasm wrote:
             | >different colors for different persons speaking
             | 
             | BBC iPlayer does this for some content, I don't know if
             | it's ever on movies though.
        
               | masfuerte wrote:
               | It is. The iPlayer subtitles for Citizen Kane use colour
               | to distinguish speakers.
        
             | jeroenhd wrote:
             | Some of the best subtitles I've ever seen were on Tom
             | Scott's YouTube channel. They use different colours,
             | indicators for jokes and sarcasm, while also staying
             | relatively close to what's actually been said. They're
             | better than many big-budget movies and TV shows I've seen.
             | 
             | He talked about subtitling at some point, and I was
             | surprised how cheap subtitling services are. I think he
             | went beyond the price he mentioned, but it really made me
             | question why big, profitable YouTube channels aren't
             | spending the small change to do at least native language
             | subtitles that Google can translate, instead of relying on
             | YouTube's terrible algorithm
             | 
             | That said, Whisper seems to generate quite good subtitles
             | that take short pauses for timing into account, but they're
             | obviously neve going to be as good as a human that actually
             | understands the context of what's being said.
        
               | thylacine222 wrote:
               | Whisper can also generate timings at the word level,
               | which you could use to make better-timed subtitles
        
               | leobg wrote:
               | Yes. But Whisper's word-level timings are actually quite
               | inaccurate out of the box. There are some Python
               | libraries that mitigate that. I tested several of them.
               | whisper-timestamped seems to be the best one. [0]
               | 
               | [0] https://github.com/linto-ai/whisper-timestamped
        
             | leobg wrote:
             | That's a great use case for LLMs, actually. Translate the
             | sentence only up to what has been said so far. Basically, a
             | balance between translating word-for-word (perfect timing,
             | but terrible grammar) and translating the whole sentence
             | and/or thought (perfect grammar and meaning, but
             | potentially terrible timing).
             | 
             | With the SRT file format for subtitles, I think, there's no
             | reason why one couldn't make groups of words appear as they
             | are spoken.
             | 
             | Actually, I have to do the same thing when generating the
             | dubbed voices. Otherwise it feels as though the AI voice is
             | saying something different than the person in the video,
             | especially when the AI finishes speaking and you still hear
             | some of the last words from the original speaker.
        
               | postexitus wrote:
               | Unfortunately not all languages follow the same sentence
               | structure, so translating "up to what has been said so
               | far" is not possible.
               | 
               | Assume 2 dramatic stops in an English sentence, and
               | observe Turkish version. You can "I will.. go to.... the
               | cinema" "Ben... sinemaya... gidecegim" (I .. to the
               | cinema.. go)
               | 
               | I am sure there are smarter examples.
        
           | alexdbird wrote:
           | I prefer subs over dubbing for foreign languages, but I
           | cannot stand closed captions (for people who can't hear at
           | all) because having your eye drawn to the bottom of the
           | screen for a description of something I don't need to know
           | about is horrible!
        
             | vidarh wrote:
             | Sometimes it's hilarious when they're trying to describe
             | the dramatic tension from sounds or music, and "reveal" all
             | the cliches, though. "Music swells to a tear-jerking
             | crescendo"
        
           | matsemann wrote:
           | I think you get used to it. Like a punchline I've read, but I
           | don't "register" it until the proper thing happens on the
           | screen.
        
           | vidarh wrote:
           | I'm Norwegian, and Norway used to be near-universally non-
           | dubbing other than for TV for the _very_ youngest children,
           | and even then almost exclusively cartoons or stop motion etc.
           | where it wasn 't so jarring. But the target age of material
           | being dubbed has crept up as it has become relatively-
           | speaking cheaper to do compared to revenues generated in what
           | is a tiny market.
           | 
           | The thing that annoys me the most about it is that it often
           | alters the feel of the material. E.g. I watched Valiant
           | (2005) with my son in Norwegian first, because he got it on
           | DVD from his grandparents. He doesn't understand much
           | Norwegian, but when he first got the DVD he was so little
           | that it didn't matter. A few years later we watched the
           | English language version.
           | 
           | It comes across as _much darker_ in the English version. The
           | voice acting is much more somber than the relatively cheerful
           | way the Norwegian dub was done, and it while it 's still a
           | comedy, in comparison it feels like the Norwegian version
           | obscures a lot of the tension, and it makes it feel almost
           | like a different movie.
           | 
           | I guess that could go both ways, but it does often feel like
           | the people dubbing something are likely to have less time and
           | opportunity to get direction on how to play the part, and you
           | can often hear the consequences.
        
         | wodenokoto wrote:
         | > but wouldn't a seven year old be able to read subtitles?
         | 
         | No, they wouldn't.
         | 
         | I don't believe that most swedes learn English by reading
         | subtitles before starting school.
         | 
         | > I think there's a pretty strong correlation between
         | countries' average English proficiency and how common dubbing
         | is.
         | 
         | That I agree with.
        
           | NicoJuicy wrote:
           | Most people in Belgium learn English through that before
           | school.
           | 
           | Why wouldn't swedes?
        
             | wodenokoto wrote:
             | You are saying that _most_ kids in Belgium can read
             | subtitles before they start school?
             | 
             | It took me several years of school before being able to
             | read fast enough to follow along subtitles, and the same
             | for everyone I know.
        
               | Freak_NL wrote:
               | They probably meant before they start learning English in
               | primary school, not before they start school.
               | 
               | This used to be the case in the Netherlands too; I picked
               | up a significant body of English from British TV series
               | watched with subtitles as kid. Nowadays this advantage
               | will probably be missed by most children, because the
               | streaming services offer a lot of dubbed content, and you
               | get to pick what you watch unless someone guides you.
               | Subtitles can be avoided for longer.
        
               | NicoJuicy wrote:
               | I think you're native English and have the associated
               | bias concerning how it works in practice?
               | 
               | Since kids first learn their native language ( write,
               | read and speak) in school and only years after then (
               | mostly), learn foreign languages.
               | 
               | When they learned to do it in their native language, they
               | hear English spoken on tv with eg. Dutch subtitles and
               | pick it up. Sometimes before they have English lessons.
               | 
               | Most kids, as such, know a fair amount of English before
               | they have it ( = English) in school.
               | 
               | The Dutch subtitles isn't always a requirement though.
               | Kids will pick it up in some shows, eg. Pokemon would be
               | a good example if English spoken.
        
             | duckmysick wrote:
             | Are you saying that kids of age six can understand and
             | speak English at a basic level - say half way to A1?
             | 
             | Or is it just a basic familiarity (like a couple of most
             | common words) and awareness that English exists?
             | 
             | EDIT: I see from a reply below by Freak_NL that it probably
             | means before the kids _start learning English at school_.
             | That makes more sense, as they would be older at that
             | point.
        
               | kwhitefoot wrote:
               | > as they would be older at that point.
               | 
               | I don't know about The Netherlands but here in Norway
               | children start learning English as soon as they start
               | school at the age of five or six. But quite likely many
               | of them will have at least some English already because
               | of English language television, computer games, etc.
        
           | input_sh wrote:
           | > I don't believe that most swedes learn English by reading
           | subtitles before starting school.
           | 
           | It's not about learning the language per se, it's about
           | familiarizing yourself with the sound of the language, which
           | then makes formal learning feel much more intuitive. English
           | becomes an easy subject because you always feel a little
           | ahead of the material. When faced with a "fill in the blank"
           | type of questions, you're able to answer them by what _feels_
           | right, even when you can 't quite explain _why_ it feels
           | right.
           | 
           | It's why #1 rule of language learning at any stage in life is
           | always gonna be immersing yourself with the language you want
           | to learn, and by far the most effective way to immerse
           | yourself (excluding moving to another country) is to consume
           | content in your target language.
        
           | anhner wrote:
           | >> but wouldn't a seven year old be able to read subtitles?
           | 
           | > No, they wouldn't.
           | 
           | hard disagree
        
             | voidpointer wrote:
             | Reading speed at that age will vary greatly. Reading
             | subtitles while also having to follow the picture takes
             | away focus and that makes it hard much harder for an
             | inexperienced reader. My daughter, who picked up reading
             | very naturally would have been able to follow sub-titles at
             | age 7 without much trouble. My younger, 7-yo son on the
             | other hand, who is more average in reading ability wouldn't
             | be able to keep up with subtitles yet. Average reading
             | speeds at age 7 seem to be 60-100 words per minute where
             | subtitles are more at the 100-150 words per minute range.
             | So for above-average readers, it will be possible but for
             | the average, they won't be able to keep up consistently.
        
           | konschubert wrote:
           | My 6 year old has been watching 20 minutes of cartoons every
           | night for the past two years. This is the only exposure to
           | the English language that she has ever had.
           | 
           | She has learned to understand what is said in the cartoons.
           | Of course she misses some things, but it's surprising how
           | much she gets.
           | 
           | Like, when I ask her "what did Bluey just say?", she can
           | explain it.
           | 
           | Children's brains are awesome.
           | 
           | But actually, grown-ups can also pick up quite a lot if they
           | actually immerse themselves.
        
             | mysterydip wrote:
             | Bluey is an excellent cartoon to do that with. Kudos!
        
               | konschubert wrote:
               | I just wish there was a way to buy the Australian
               | original version as a download.
        
           | vidarh wrote:
           | Subtitles in a foreign language? Probably not. _Subtitles
           | translated into their original language_? I think it 's
           | probably an exaggeration that people have learnt it before
           | starting school because it implies a lot about what learning
           | it means, but picking up a number of words, sure.
        
           | ivanhoe wrote:
           | Young kids don't even need subtitles, their brains are wired
           | to figure out spoken languages, after all that's how we all
           | learn our mother tongue initially. Last summer my then 3.5
           | years old, to my huge surprise, started talking in (simple,
           | but correct) English with some tourist kids she met in the
           | park. We never spoke English in home with her before, so I
           | presume she picked it up from youtube and her older brother,
           | but I had no idea she can form full sentences - including
           | conditionals and past tense. And at first she was a bit slow
           | to express her self, but after a few hours of play with those
           | kids she sounded totally relaxed and fluent.
        
         | darkwater wrote:
         | A 7yo can barely keep up with subtitles in their mother tongue,
         | depending on the speed. And that's probably true for a p90
         | reader. A p50 there is no way it can follow subs understanding
         | what they say. Now, being a video, they might be able to
         | interpolate from what they see, so it might be a nice
         | challenge. But doing this with subtitles in a foreign language
         | is only for a few, privileged minds.
         | 
         | Source: father of a 8yo with VERY good reading skills (already
         | reading books in 2 languages targeted at tweens)
        
         | leobg wrote:
         | Yeah, this isn't really helpful for her to learn English. This
         | is more when we watch The Anatomy Lab, or BBC's "The Incredible
         | Human Journey". She'll already be asking me a lot of questions
         | about the content. So if I had to translate on top of that, it
         | would be tedious.
         | 
         | Subtitles - those are actually being generated as well. I've
         | generated SRT files during development. Color coded by speaker,
         | and on a per-word basis, for me to get the timing right.
         | 
         | Basically, if you have a YouTube channel, you can take any
         | video from your channel, run it through Speakz.ai, and you'll
         | get 15+ additional audio tracks in different languages, plus
         | 15+ subtitle files (SRT).
         | 
         | Voice cloning and speaker diarization was a bit of a challenge.
         | On the one hand, I want to duplicate the voice that is being
         | spoken right now. On the other hand, sometimes "right now" is
         | just a short "Yeah" (like in the Elon interview) which doesn't
         | give you a lot of "meat" to work with in terms of embedding the
         | voice's characteristics.
         | 
         | Right now, I'm using a mix of signals:
         | 
         | - Is the utterance padded by pauses before/after? - Is the
         | utterance a complete sentence? - Does the voice of the
         | utterance sound significantly different from the voice of the
         | previous utterance?
         | 
         | It's a deep, deep rabbit hole. I was tempted to go much deeper.
         | But I thought I better check if anybody besides myself actually
         | cares before I do that... :)
        
         | ChemSpider wrote:
         | Dubbing in Germany is horrible and pervasive. Even in the news
         | and interviews. Subtitles are cheaper and better.
         | 
         | As others have said, it is better to expose kids (that can
         | read) to the original language plus subtitles.
         | 
         | So in other words, your solution while technically great is
         | pedagogical not wise. A typical geek approach to a problem ;)
        
           | rob74 wrote:
           | The worst thing about dubbing is that it's more important for
           | the translations to have roughly the same length and
           | correspondence to the original mouth movements than to be
           | accurate. So the original meaning is often altered, and you
           | don't even know it because of course you have no easy access
           | to the original most of the time. But unfortunately Germans
           | are so used to dubbing that subtitles don't really stand a
           | chance. There are a few cinemas here and there that show
           | original-language movies with subtitles, and on TV there was
           | one experiment that I'm aware of a few years ago (on Pro
           | Sieben Maxx) to show TV series with subtitles, but it was
           | cancelled after some time. AFAIK it's also more expensive to
           | secure the rights to show English-language content compared
           | to dubbed content.
        
         | poulsbohemian wrote:
         | >I know Germany dub most video, but wouldn't a seven year old
         | be able to read subtitles?
         | 
         | I gotta say... while sometimes it is a necessarily evil, I
         | would so rather not have to read subtitles. I often want to
         | listen to a show so that I can also continue working on
         | catching up on email, etc, IE: I can't read two things at once,
         | but I can listen to one thing and continue working on something
         | else.
        
       | 2099miles wrote:
       | Talked about this idea last month since astrobiology still isn't
       | all dubbed to English. Thank you for actually making the tool,
       | it's awesome, huge Kudos.
        
       | sss111 wrote:
       | can you add hindi as an output language, been meaning to build
       | something like this for my parents. You saved me some work haha!
        
       | jianshen wrote:
       | Wow this is amazing. If there was a locally running version
       | available, I would gladly pay money for it.
        
         | leobg wrote:
         | Thanks!
         | 
         | Well, to make local happen I'd have to learn more about local
         | app development.
         | 
         | I'd also be worried about having to support a bunch of
         | different platforms, and being beholden to ever changing rules
         | made by App Stores and OS makers. I actually work on a 2015 Mac
         | with a 2019 operating system. There are many great looking AI
         | apps that I'd love to run but can't.
         | 
         | Besides, it seems to me that making this centralized makes
         | economic sense. I can just keep the GPU busy with lots of
         | videos from many customers. I'm sure that's what most people
         | think who build something: "The world would be so much better
         | if everyone just came here and used this." :)
        
       | solardev wrote:
       | This is really impressive! Can't wait to see this more fleshed
       | out. I'd gladly pay for something like this (by the video,
       | ideally).
       | 
       | Some page feedback though: It seems to me that the video just
       | keeps playing, with no way to restart it or scrub through the
       | timeline. Each time I click a language, it changes the spoken
       | audio but just keeps playing where it left off. That makes it
       | hard to compare the same passage across different languages.
       | 
       | Separately, I think there are also some errors in translation.
       | For Sample 3 (about the vines), the original in Mandarin Chinese
       | says something like "if this tree gets grabbed, the weed will
       | climb up and wrap around it, and the tree won't be able to
       | photosynthesize and will die". But the English mistranslation
       | says "If it gets scared by people, it gets pulled off and messed
       | with. It can't function. The evil effects? It just dies."
       | 
       | There are also timing issues where the translations don't match
       | up with the original subtitles or dialogue, and certain parts of
       | the original audio just seem to be altogether ignored and not
       | translated.
       | 
       | Maybe displaying the translated subtitles, allow with a way for
       | users to report errors, would help...?
        
         | leobg wrote:
         | Thank you very much!
         | 
         | Yes. You cannot control the video playback on the demo page. I
         | made it so because I wanted a way to showcase how you can
         | switch between languages. You can go from Elon speaking English
         | to German, Russian and Chinese, each with just one click.
         | Activating the player controls would have made the UI more
         | complex and distracting. And it would have also made it harder
         | for me to sync the timing between languages.
         | 
         | Of course, the real output would be a proper player, with all
         | of the controls. Or, for creators, raw files (video and/or
         | audio, plus SRT subtitles).
         | 
         | I also noticed problems in the translation of the Chinese
         | video. I put it up there anyway, because I figured most people
         | coming to my site would be English speakers, and being able to
         | understand a Chinese video might be another interesting aspect,
         | in addition to the idea of being able to turn your own English
         | content into languages you don't speak.
         | 
         | If this had been a pitch deck, I would have cherry picked the
         | samples. But I wanted to share where the project is right now
         | and see if anyone was interested. Premature optimization is the
         | root evil of all programming. I think Knuth said that. And it's
         | a trap I regularly fall into. So I tried to be disciplined this
         | time.
         | 
         | But if any Chinese YouTuber would ask me to dub their work
         | today, I'd make darn sure that the translations were close to
         | perfect. Meaning I'd allow the system to make changes to the
         | way things are phrased if that's necessary for the purpose of
         | timing or cultural context. But I wouldn't allow it to skip a
         | thought from the original video, or say something something
         | different.
         | 
         | I've translated books by hand in the past. So this is something
         | I care about. If the demo isn't perfect in this regard, it's
         | because I didn't know if anyone was going to even look at my
         | project. When I first posted this yesterday, my submissions
         | didn't go beyond one comment for several hours. I already
         | thought I had built another solution looking for a problem. :)
         | 
         | If you're seeing dropped phrases, that's most likely because my
         | arranging function failed. Basically, the translation ran
         | longer than the original. The algorithm tried to speed it up
         | and fit it in. But it failed and dropped it. Better handling of
         | these overruns are on my to-do list. Neither drops nor speedups
         | should be tolerated.
         | 
         | In terms of self-correction, I plan to feed the translated
         | audio back into the transcription engine. Then, an LLM can
         | compare the translation with the original transcript. If
         | anything is missing, the pipeline will be force to run again
         | with slightly different parameters. There shouldn't be a human
         | neccessary in the loop. Translation is what Transformers are
         | best at.
        
           | solardev wrote:
           | Gotcha, thanks for the great walk-through and in-depth
           | explanations! Excited to see how this thing progresses.
           | 
           | I'd totally pay to have something like this as a Chrome
           | plugin for YouTube, for example.
        
       | gagabity wrote:
       | This is great, I tried to do a similar thing once but my language
       | is one of those that AI doesnt do well.
       | 
       | I think you can look into muting the original voice in the video,
       | I remember I saw there is some AI/tech that can separate audio
       | into voice and nonvoice.
       | 
       | Yandex browser does this in the browser, you open a YT video and
       | it offers to translate, a few seconds later the voices are all in
       | Russian, it's probably the most interesting production use of AI
       | I have seen and for free. It's to Russian only unfortunately.
        
       | maxglute wrote:
       | Very passable. Waiting for something local like this for foreign
       | language PLEX and podcasts. As someone who views/listens to
       | things at 2x/3x speed 10-15 bucks an hour is cost prohibitive.
        
         | leobg wrote:
         | Perhaps you could ask some of your favorite podcast hosts to
         | make a deal with me. Running a training on their voice once and
         | then just re-using that will be much cheaper. Also, customers
         | who buy in bulk will help me focus on this full time. There is
         | huge potential for making this faster and cheaper.
         | 
         | (Even using OpenAI is silly. Technically, I neither need
         | GPT-4's knowledge nor instruction tuning. Both is unnecessarily
         | adding cost and latency. But it helped me get the demo out.)
         | 
         | Basically, the deal for Podcasters / YouTubers would be:
         | 
         | - Get all their episodes converted into 15+ languages -
         | Increase their reach today, while the novelty is high and the
         | market is still uncrowded - They get to tell their sponsors
         | that they now have reach across the language boundary
        
           | maxglute wrote:
           | I don't know state of podcast ecosystem, but I think you
           | should reach out to listennotes.com whose also a 1 man job
           | that seems to elevate discovery and looks like has reasonable
           | reach for producers. Or go hit up some popular western
           | podcasts, you've definitely got something here and the
           | execution is good enough.
        
       | theogravity wrote:
       | Wonder how accuracy of the translation is measured (if at all).
        
         | leobg wrote:
         | One idea I have is to use back-translation. After generating
         | the new language audio, feed it back into the transcription,
         | and then have an LLM compare it to the transcript from the
         | original. Penalty for any thought/detail that is missing. If
         | too bad, start from scratch.
        
       | daremon wrote:
       | This is really amazing! Well done.
       | 
       | I already joined the beta but I want to point out another use
       | case here as well:
       | 
       | In many countries (ie Greece where I'm from) movies and TV shows
       | never get dubbed. We rely on subtitles. This means that if you
       | can't see well (disability or age-related eye problems) and if
       | your English is not excellent, then you are doomed to only watch
       | locally produced movies & shows.
       | 
       | This can be a real life-changer.
        
         | leobg wrote:
         | Thank you!
         | 
         | With movies, I think I could get into legally challenging
         | territory. I guess all AI apps are, in a way. But with movies,
         | there's an entire industry behind enforcing copyright. So I
         | must tread carefully on that front.
         | 
         | I made the jump from the courtroom into VS Code years ago. I
         | really don't want to go backwards.
        
           | daremon wrote:
           | I honestly don't see how movies are different with any
           | content ie YouTube videos. I am pretty sure MrBeast etc have
           | the same lawyers as any big studio.
           | 
           | Could this run locally? I would certainly pay for that and
           | you're off the hook on how anyone uses it.
        
       | waldrews wrote:
       | Impressively done! It sounds like you're doing
       | 
       | 1) doing voice recognition with voice time clues, which Whisper
       | and the like provide, breaking it up into sentence (or similar)
       | units; you don't need to time match individual words, but you
       | need to time match at coarser grain.
       | 
       | 2) using a translation engine that allows for multiple
       | alternative translations
       | 
       | 3) cloning the original voice, regardless of language
       | 
       | 4) choosing the translation that has the best time match
       | (possibly by syllable counting, or by actually rendering and
       | timing the translations). If there isn't a close translation,
       | maybe you're asking ChatGPT to forcibly rephrase?
       | 
       | 5) Maybe some modest pitch-corrected rate control to pick out
       | path that gets you closest to the timing?
       | 
       | Did I get any of that right?
        
         | euazOn wrote:
         | I also noticed that the third sample with Chinese sounds
         | slightly sped up in the first English segment, so there may be
         | also an element of postprocessing the dub (speeding it
         | up/slowing it down).
        
           | leobg wrote:
           | Yes. Though I don't like this solution. It breaks the flow.
           | And it also doesn't really fully solve the problem. Overruns
           | still accumulate if they happen too frequently. One second
           | here, one second there... the further you get into the video,
           | the worse it gets.
           | 
           | I think it would be better to either slow down the underlying
           | video or solve the overrun issue on the translation level. A
           | good professional dubber will find translations that will
           | even out in terms of timing. That's something an AI should be
           | able to do better instead of worse.
        
           | odiroot wrote:
           | The last sample from BBC is really hilarious when translated
           | to Polish. Something definitely went wrong and the voice
           | speaks like a drunkard.
        
         | waldrews wrote:
         | Ooh and you're probably doing a split into voice and non-voice
         | tracks of the original, and keeping non-voice at original
         | volume, but lowering the voice track.
        
         | davidzweig wrote:
         | I think it's a speech to speech model, I know about
         | seamlessm4t:
         | https://www.google.com/amp/s/about.fb.com/news/2023/08/seaml...
         | 
         | Interesting, but what inference engine supports it to run at
         | decent speeds?
        
         | leobg wrote:
         | Very good!
         | 
         | Yes, that's basically how it works.
         | 
         | I don't do any pitch-correction. But I do check the TTS output
         | for lenght, and I re-generate if it doesn't match my time
         | contraints.
         | 
         | I also have an arranger that tries to figure out when to play
         | an utterance early (i.e. earlier than in the original) in order
         | to make up for the translated version being longer.
         | 
         | I try to make the translations match the speaker's character,
         | as well as the context. So ideally, Alex (Sample 2) will still
         | say "Salut" even in German (instead of translating that
         | greeting, too).
         | 
         | And I need to monitor for speaker changes. This is because I
         | can't clone the voice unless I have a decent amount of sample
         | data. If Elon just says "Yes", cloning the voice based on just
         | that one syllable will make it sound like a robot. But I also
         | can't just blindly grab any voice around it, since that might
         | be somebody else's voice.
        
       | lIIllIIllIIllII wrote:
       | This might be a game-changer for preserving declining languages.
        
         | leobg wrote:
         | I'm not sure.
         | 
         | Making this work is limited by the availability by reliable
         | transcription models. Which, in turn, are limited the
         | availability of large training corpora. Those don't exist for
         | rare languages.
         | 
         | Also, if people choose to listen to a Nepali speaker through an
         | AI translator, that does give speakers of this language "a
         | voice" - but it doesn't really preserve that language. You
         | might argue, on the contrary, that it may remove any remaining
         | incentives to learn that language.
        
       | azamba wrote:
       | What would be required to add a new output language? E.g
       | Portuguese? I know it's supported as input.
        
       | crtasm wrote:
       | What potential issues with copyright are there from offering
       | (paid) access to this tool to run on sources including Youtube,
       | and with the output containing the source audio?
        
         | Freak_NL wrote:
         | You're creating a derived work, so you would be violating
         | someone's IP unless the licence is permissive, and making money
         | of it. That usually attracts the attention of whoever owns the
         | IP.
        
           | dns_snek wrote:
           | Who's "you"? I'm obviously not a lawyer, but instinct tells
           | me that end users are the ones creating a derived work by
           | uploading a video they may or may not have the right to
           | distribute. Linked website is just a tool, perhaps a cloud-
           | based one, but still just a tool.
        
             | crtasm wrote:
             | Not just files a user uploads: "You can select videos from
             | YouTube"
        
               | dns_snek wrote:
               | Thanks, I missed that. I can see how that would
               | complicate things.
        
         | leobg wrote:
         | You could have asked the same question when Google started
         | building their index. Or OpenAI trained their models.
         | 
         | I'm in Germany. I'm a licensed lawyer. I see the dangers.
         | 
         | The safest path will be to simply offer the production of multi
         | language translations to content owners themselves. Which is
         | also going to be more efficient - translating the thing at the
         | source, rather than having consumers each create their own
         | translation.
         | 
         | But the original intent for this has been to have my computer
         | translate a video I want to watch with my kids in my private
         | home. Technically, it's not "my computer" in the sense of being
         | just the device that's physically in my home. There's stuff
         | that happens in the cloud. Technically, copies are being made.
         | So one could argue the point.
         | 
         | For most people today, getting your content seen and consumed
         | is the highest you can achieve. To sue someone from another
         | country who cares enough to pay someone else to translate it
         | for him would seem bonkers. But I'm sure there are lawyers who
         | are desperate for work. Who cannot code, and can't be bothered
         | to learn, but still want to do something "in AI". I'll at least
         | give them a hard time. And dare they use ChatGPT hallucinated
         | references on me! :)
        
           | crtasm wrote:
           | I was more thinking Youtube and record labels might take
           | issue with the service, e.g. how they go after stream ripping
           | sites.
           | 
           | Having the creator put a code in their channel description to
           | verify ownership could be a good approach. Thanks for sharing
           | the project!
        
       | hbarka wrote:
       | Wow, this is impressive. Is there anything like it for live
       | translation?
        
         | leobg wrote:
         | Not on my to do list currently. EzDubs say they do live [0].
         | Also, a friend of mine mentioned some Samsung / Android app
         | that does this?
         | 
         | [0] https://ezdubs.ai
        
       | patrickhogan1 wrote:
       | Awesome! Large % of foreign streams have no proper dub.
        
       | godzillabrennus wrote:
       | Would love this for Plex.
        
         | leobg wrote:
         | Would love to. Can you broker a deal? :)
        
       | jeroenhd wrote:
       | I've only ever experienced Dutch dubs in kids' TV but I feel like
       | these examples show that your Dutch model may need some work. I
       | can't judge other languages well, but I found the Taiwanese
       | documentary dub especially hard to follow. I wouldn't have
       | expected Dutch to be in there for how little the language is
       | spoken and how often Dutch speakers will understand English,
       | though!
       | 
       | /offtopic It seems to do a pretty interesting thing where the
       | first male voice has a bit of a Flemish/southern accent while the
       | second male voice has an accent much closer to "Netherlands TV"
       | Dutch. Reminded me a bit of the Lion King dub where the dub
       | studio used Flemish voice actors to do the jungle animals (and
       | Dutch voice actors for the savannah animals) to underline the
       | "different world" Simba arrived in.
        
         | leobg wrote:
         | Yes, that issue is also present in the German translation.
         | 
         | I'm planning to monitor the output quality. Basically, feeding
         | the translated audio back into the transcriber. Then compare it
         | to the original transcript. Like a higher level loss function.
         | I'll need this already because I don't speak all of these
         | languages myself. But I can also use it to make the pipeline
         | self-regulate and generate a new, better version if the last
         | one scored too poorly.
        
           | jeroenhd wrote:
           | Interesting, I can see how that approach would catch the
           | weird voice lines.
           | 
           | Just the different ways the languages get picked up and
           | processed by the AI system could be interesting. If you find
           | anything cool, I'd love to read a blog post about it!
        
       | cheriot wrote:
       | Great tool!
       | 
       | Curious if you've made more progress on diarization than what's
       | described in this article?
       | 
       | https://aipressroom.com/streamline-diarization-using-ai-as-a...
        
         | leobg wrote:
         | They use pyannote/speaker-diarization. I tried that, but it
         | wasn't accurate enough for my purposes. Made a confusion matrix
         | with voice samples from The Simpsons characters. It looked...
         | well, confused.
         | 
         | Am using a mix now of speaker embeddings and other signals (end
         | of sentence, pause before/after, etc.). As you can see in the
         | demos, it already works well for interview situations. It's
         | when there are 3+ speakers and they talk over each other that
         | the system gets confused.
        
       | changoplatanero wrote:
       | Why doesn't youtube have something like this built in.
        
         | madduci wrote:
         | I guess it's computationally expensive? OP states in the
         | website that their solution takes 1 hour of processing for 30
         | minutes long videos .
         | 
         | Now imagine offering this for all the YouTube videos available:
         | 
         | - either it's done on their servers (hard to believe due to
         | high costs) - either it's done on client side (which is also
         | difficult, due to lack of processing power)
        
           | leobg wrote:
           | Well, I'm also shamefully unoptimized at the moment.
           | 
           | YouTube added auto-captions years ago. Long before there was
           | Whisper, let alone things like Whisper.cpp. I imagine what
           | I'm doing now is computationally no more expensive than what
           | they did back then.
        
         | Freak_NL wrote:
         | And automatically forced on every user depending on whatever
         | their Google account is set to, just like the video titles
         | which now get auto-translated without any way to turn this off.
         | We're sliding back into a monolingual world.
         | 
         | No thanks.
        
           | leobg wrote:
           | As a native German, I also resent it when
           | Google/Amazon/whoever tries to force a translation on me when
           | I prefer the original language. So I wouldn't want to live in
           | a world where everything would be dubbed into German for me.
           | Not even if they used my tool :)
           | 
           | Regarding YouTube:
           | 
           | AFAIK, YouTube allows you to add multi-lingual voice tracks
           | to your videos. Then, if the viewer has a preferred language
           | set, the video will play in that language. Else in the
           | language inferred by his browser/OS. But the user can also
           | switch back to the original language, or any other language,
           | right in the player.
        
             | Freak_NL wrote:
             | You can't switch to original titles though, so I'm not
             | really confident Google is going to be offering this option
             | for long.
        
       | Gunnerhead wrote:
       | Amazing! I love this for dubbing, but was wondering if anyone
       | knows of an AI powered subtitle generator for YouTube videos? I
       | know YouTube has closed caption, but it's terrible.
        
         | leobg wrote:
         | Speakz actually generates subtitles as a byproduct. The idea is
         | that you put in a video, you select the target languages. And
         | then you get out, for each target language, an audio track and
         | an SRT subtitles file.
         | 
         | Someone else here asked about generating only subtitles, with
         | no audio, as a cheaper option. So I'll probably add that as an
         | option.
        
           | Gunnerhead wrote:
           | I would love that to help learn another language!
        
       | leke wrote:
       | I'm looking for a tool that will take a foreign language and
       | automatically generate subtitles in my language. Anyone know of
       | such a tool?
        
       | ocolegro wrote:
       | So when is the crunchyroll integration rolling out?
        
       | exitb wrote:
       | I suppose you're paying a lot for the voice cloning, so do
       | consider that in many countries voice-overs are done with a
       | single generic voice. Would you consider a lower price point
       | service doing just that?
       | 
       | I'm doing something similar and using GPT-4 for translation.
       | What's unique about it, is that you can specifically prompt it to
       | avoid long translations by rephrasing things, so you can buy
       | yourself some time for the "Fledermaus".
        
         | leobg wrote:
         | Using a single speaker makes sense when you're paying for human
         | voice talents. But if you're using a computer to generate the
         | voice, why not generate a voice that sounds like the original
         | speaker? Much more fun to hear Elon speak Chinese :)
        
           | exitb wrote:
           | To be blunt, because of the price. I'm running the whole
           | pipeline of my toy project for much less than $5 per hour. If
           | the voice cloning is the long pole in the tent, I'd just
           | consider dropping it.
           | 
           | Moreover, it's a cultural custom. I'm from Poland and here
           | the voice-over narrator is supposed to be generic and bland,
           | so your brain learns to tune him out and take the emotional
           | cues from the original voice.
        
             | leobg wrote:
             | Yes, in Germany we have generic voice-over narrators, too.
             | In documentaries, etc.. They usually match the gender of
             | the speaker, but that's it.
             | 
             | Personally, I read most of my books with the iOS app Voice
             | Dream Reader. That app still uses old TTS voices. They
             | sounded great 3 years ago, but now sound robotic when
             | compared to Elevenlabs or WaveNet. But, as you say, you
             | learn to tune out the voice. I can read entire novels like
             | this, and I still "hear" the different voices and
             | personalities. It just all happens in my imagination.
             | 
             | How much I'd need to charge to make the project worthwhile
             | depends on many factors. And I didn't want to name a price
             | now and then backpaddle a month from now and say it'll
             | actually cost more.
             | 
             | My pipeline right now is super unoptimized, to the point of
             | being embarrassing. This can all be made to run much faster
             | and cheaper.
             | 
             | I agree with you that if the voice cloning part of my
             | pipeline causes a significant chunk of the cost that the
             | end user pays for the service, that I should then offer the
             | option of using a "bland" voice instead for a lower price.
        
       | mannycalavera42 wrote:
       | it's a lovely parenting story. Let me tell you there is also a
       | huge opportunity for the opposite use case. My elderly parents
       | speak only one (non-Eglish) language. I would love to have a
       | (cheaper) way to provide my parents with translated videos with
       | the addition of (translated) subtitles. Subs are important
       | because elderlies can have hearing issues great work, inclusivity
       | is love
        
         | felixarba wrote:
         | I second this! I was recently looking into a way to build
         | something like this for my grandfather, but wasn't even sure
         | where to start from the hardware side.
         | 
         | I wanted to have hardware plug into TV receiver, generate
         | subtitles for live TV program and then play it back on TV.
         | Delay would likely be less than a minute but even a few minutes
         | is not a problem really.
         | 
         | Many people with a hearing problem would benefit from this and
         | with AI getting so good at Speech-to-text, this can be done for
         | quite a large population.
         | 
         | If anyone has a recommendation on where to start with this, I'd
         | appreciate it! Was thinking of using Whisper for subtitle
         | generation, but not sure about hardware that can take in, and
         | output HDMI and run this software
        
           | leobg wrote:
           | I keep thinking about something similar. Also hardware. Also
           | for my grandparents.
           | 
           | My grandma is 95. Her vision is bad. Even using the phone
           | (I'm talking old school landline) is getting hit and miss,
           | because she can't see the buttons.
           | 
           | Years ago, I set her up with an Echo Show. That works well
           | enough for her to say "Call Leo". But Alexa is dumb.
           | Sometimes, she'll mishear something and start playing music.
           | Or start a monthly subscription... :)
           | 
           | So what I'd like:
           | 
           | - box - screen - far-field mic array - AI backend
           | 
           | You could do a number of things with it:
           | 
           | - manage a grocery shopping list (AI will notice duplicates
           | and other oddities and ask) - communicate with the outside
           | world (initiate calls, send emails and faxes, including to
           | local businesses) - optional human oversight and/or
           | permission settings (preventing the AI, say, from ordering
           | groceries for more than $50 a week without a family member
           | approving the order)
           | 
           | Something like your "subtitle mode" could also work:
           | 
           | "Listen to what is currently being spoken in the room
           | (including the TV), and display it on the screen".
           | 
           | My grandma has her TV running all day. So maybe one could
           | ditch the screen and make it a "set top box". Add an IR port
           | to it, so it can control also the TV itself. Something like
           | that might work.
        
         | leobg wrote:
         | I'm generating translated subtitles internally, before
         | generating the voice-over. Also, generating those subtitles is
         | way, way cheaper. If someone just wants the subtitles, I could
         | offer them.
         | 
         | Bigger question is: What device are your parents using, and
         | what content sources? Because I'd need to be able to download
         | the audio, and inject the subtitles. With a regular TV, I
         | wouldn't know how to do that.
        
           | mannycalavera42 wrote:
           | Android device (either phone or tablet) that they can then
           | send over to a chromecast
           | 
           | the chromecast could be a nice to have, not super necessary.
           | they can put the tablet on a table close to them
        
       | Timwi wrote:
       | Would have been nice to get a download link for a program I can
       | just run locally.
        
         | bbitmaster wrote:
         | Alas! Didn't get the memo? We're In the era where AI tools are
         | all "software as a service," and you must pay for individual
         | inferences from the model. How could they charge for inferences
         | if they gave you the model to download?
        
           | Timwi wrote:
           | Amen, bro.
        
           | leobg wrote:
           | I don't have access to any special model that I'm holding
           | back from you in order to rent-seek.
           | 
           | Anyone can learn to build something like this. The parts are
           | all available out there. There's Whisper. There's Mistral-7b.
           | There's Tortoise, Coqui, SV2TTS. There's Python.
           | 
           | The bigger question is:
           | 
           | Would you want to?
           | 
           | I've been building web apps for several years now. I've sunk
           | thousands of dollars into those projects. And literally years
           | of my time. If I calculated my hourly wage, I'd be below a
           | teenager mowing lawn. In Rwanda. And by a factor of 10x,
           | probably.
           | 
           | The real ROI here is the learning. And that's not something
           | I'm "taking" from anyone.
        
             | rubymamis wrote:
             | You are awesome, man! Keep it going and all the best of
             | luck!
        
       | welder wrote:
       | I do the opposite and seek out videos in other languages for my
       | kids to watch. Now they're learning German, Spanish, Chinese, and
       | Japanese.
        
       | android521 wrote:
       | any opensource available?
        
       | ipsum2 wrote:
       | Does this use seamless4t or similar projects?
        
       | pavelboyko wrote:
       | Please consider adding Simplified English as an output language
       | option, preferably with a level, e.g., A2, B1, etc. This way, I
       | can adjust the language complexity to my kids' level and then
       | gradually remove the crutches as they improve in English.
        
         | leobg wrote:
         | Yes! I love this!
         | 
         | So you'd be translating English to Simplified English? Or are
         | you talking from another source language?
         | 
         | I've already been playing with this concept w.r.t. books:
         | 
         | I take a non-fiction book. I'll have an LLM translate it with a
         | specific audience in mind (say, a 7 year old girl with a
         | certain background), explaining concepts and words that are
         | likely unknown to that audience. And then converting the whole
         | thing into an audiobook. Optional parental controls built in
         | ("exclude violence", etc.). Nowhere near showtime, though.
         | 
         | Another thing I'd love to work on is filtering existing
         | content. There are millions of videos on YouTube. Right now,
         | finding quality stuff that's fun to watch with my kid depends a
         | lot on dumb luck. But what if I could filter by topic (semantic
         | whitelist/blacklist, i.e. not keyword dependent), personality
         | traits (OCEAN, MBTI), values (e.g. "curiousity") and language
         | (reading level, vocabulary, words per minute, etc.)? I'd love
         | that.
        
           | vidarh wrote:
           | I'd love what they suggested as well, for other languages.
           | I'm working on improving my French (and occasionally German),
           | and I'm at a stage where I can follow along _some_ French
           | shows reasonably well if they 're not speaking too fast (one
           | of the first French phrases my French teacher in school
           | taught us was "plus lentement, s'il vous plait" -
           | "slower/slowly please", for a _reason_ ), and if they're not
           | speaking any particularly difficult accents, and not too much
           | slang, but it's limiting and I'm often forced to keep English
           | subtitles on as a consequence and it's sometimes too much of
           | a crutch. It doesn't help that my hearing isn't what it was.
           | 
           | Being able to "step down" the difficulty so that I can either
           | turn off subtitles entirely or rely on French subtitles, or
           | even much "difficult speech" and "simple subtitles" or vice
           | versa seems like it'd be very useful in getting over that
           | hump faster.
        
       | stuckkeys wrote:
       | Cool project. What is the tech stack behind it. I can already
       | guess few. 11labs, OpenAI...dreamtalk. There are so many similar
       | services to what you are doing. What sets you apart? You should
       | partner with local media or outside. Good luck!
       | 
       | Check out https://www.flawlessai.com they been around since 2018.
       | Interesting stuff. When I first saw it in 2021 I was blown away.
        
       | YKreator wrote:
       | Congratulations on this project! We spent 6 years developing the
       | best solution for generating perfect subtitles automatically
       | (https://www.Checksub.com). 2 years ago, with the arrival of new
       | generations of AI, we decided to go a step further and add the
       | possibility of automatically dubbing videos. But automatic
       | dubbing requires manual adjustments for a comfortable result for
       | the audience. For example, https://www.HeyGen.com generates a
       | video automatically, but offers very few editing options. That's
       | why we focus on two things:
       | 
       | 1 - to provide the best possible automatic quality 2 - to offer
       | an advanced editor that lets you fine-tune your dubbing without
       | having to go back to editing software.
       | 
       | In any case, I'm delighted to see people working on this problem.
       | I hope it will help develop this sector.
        
         | leobg wrote:
         | Great website! You're based in France? You should put a demo on
         | your website. If there is one, I don't see it (or rather:
         | hear). If you're interested in collaborating, me email is in my
         | bio.
        
         | luxpir wrote:
         | Thanks for Checksub, another happy user here.
         | 
         | We take multilingual, AI-cloned audio from you guys (split from
         | background noise, of course), after we've processed the subs
         | professionally at our end, then we align everything in your
         | tool and send off to a third service for lip sync. The result
         | has blown away a few clients now. The CEO speaking 5 languages
         | with perfect lip sync in their own tone of voice is quite
         | convincing.
         | 
         | Hopefully we can get it all in one tool soon.
        
       | hombre_fatal wrote:
       | What a great application of AI. The samples were amazing.
        
         | leobg wrote:
         | Thank you!
        
       | k__ wrote:
       | Would be awesome if it could also deep-fake the voice.
        
       | iamjackg wrote:
       | Is this using XTTS? I recognize a funny/weird glitch with Italian
       | voices saying "punto" (full stop) at the end of every sentence.
        
       | poulsbohemian wrote:
       | This is interesting because I'm the opposite of you - a German
       | speaking American, who watches a lot of German language content
       | on YouTube. Are you specifically looking for children's content?
       | I ask because almost anything I would watch in English, I can
       | find an equivalent content producer in German.
        
       | poulsbohemian wrote:
       | You deserve a lot of credit for doing this... often when I am
       | watching German content that has English subtitles, I wonder if
       | the subtitles are already being produced by AI... I sometimes
       | find the subtitles more confusing (even though they are in my
       | native language) than the German, as though they almost had to
       | have been automatically produced rather than by an actual
       | translator using contextual clues, etc.
        
       | pell wrote:
       | This is very impressive. I think the way you are timing the audio
       | is clever. What kind of model are you using?
        
       ___________________________________________________________________
       (page generated 2024-02-28 23:01 UTC)