[HN Gopher] Ask HN: Real-time speech-to-speech translation
       ___________________________________________________________________
        
       Ask HN: Real-time speech-to-speech translation
        
       Has anyone had any luck with a free, offline, open-source, real-
       time speech-to-speech translation app on under-powered devices
       (i.e., older smart phones)?  *
       https://github.com/ictnlp/StreamSpeech  *
       https://github.com/k2-fsa/sherpa-onnx  *
       https://github.com/openai/whisper  I'm looking for a simple app
       that can listen for English, translate into Korean (and other
       languages), then perform speech synthesis on the translation.
       Basically, a Babelfish that doesn't stick in the ear. Although
       real-time would be great, a max 5-second delay is manageable.
       RTranslator is awkward (couldn't get it to perform speech-to-speech
       using a single phone). 3PO sprouts errors like dandelions and
       requires an online connection.  Any suggestions?
        
       Author : thangalin
       Score  : 131 points
       Date   : 2024-10-25 02:55 UTC (4 days ago)
        
       | ohlookcake wrote:
       | I've been looking for something like this (Not for Korean though)
       | and I'd even be happy to pay - though I'd prefer to pay by usage
       | rather than a standing subscription fee. So far, no luck, but
       | watching this thread!
        
         | thangalin wrote:
         | RTranslator is quite close. It needs a TTS for the target
         | language to be installed.
        
       | billylo wrote:
       | Author of 3PO here: check out our latest version 2.12. Many fixes
       | have been incorporated in the past two weeks. Cheers.
        
         | thangalin wrote:
         | > Many fixes have been incorporated in the past two weeks.
         | 
         | Thanks for the efforts! Still many fixes to go, though: I used
         | the version from two days ago, which had numerous issues. Also,
         | 3PO isn't offline, so I won't be pursuing it.
        
           | billylo wrote:
           | No problem! Some fixes were on the server side. We had a
           | server side issue a couple days ago for a few hours. You may
           | have been affected by it, giving you those errors. They have
           | been fixed too. Take care!
        
         | supermatt wrote:
         | It works really well for me, but I wish it supported more
         | languages for the input - i guess this is a limitation of the
         | model you are using? Do you mind giving some info about the
         | tech stack? I love how fast it is.
        
           | billylo wrote:
           | thanks for the kind words!! It is built on top of a fairly
           | straightforward infrastructure. Client side is C#, .NET,
           | MAUI. Server side is Firebase Cloud Functions/Firestore/Web
           | Hosting + Azure.
        
             | supermatt wrote:
             | Thanks - I was wondering more about the translation
             | pipeline - i am assuming something like whisper and m2m100?
             | How did you manage to get the latency so low? Other
             | examples I have seen feel really sluggish in comparison.
        
               | billylo wrote:
               | Fairly basic there... chunking audio (websocket) and send
               | them for translation sequentially.
        
       | yuryk wrote:
       | moonshine?
        
         | Fairburn wrote:
         | https://news.ycombinator.com/item?id=41960085
        
       | barrenko wrote:
       | Expecting a boom in the speech-to-speech market in the following
       | months. It's the next thing.
        
         | jillesvangurp wrote:
         | It's been the next thing ever since Bill Gates said so around
         | the time of the Windows 95 launch.
         | 
         | But it does feel like we are close to getting a proper
         | babelfish type solution. The models are good enough now.
         | Especially the bigger ones. It's all about UX and packaging it
         | up now.
        
       | v7n wrote:
       | It's not exactly what OP wants out-of-the-box, but if anyone is
       | considering building one I suggest taking a look at this.1 It is
       | really easy to tinker with, can run both on devide or in a
       | client-server model. It has the required speech-to-text and text-
       | to-speech endpoints, with multiple options for each built-in. If
       | you can make the LLM AI assistant part of the pipeline to perform
       | translation to a degree you're comfortable with, this could be a
       | solution.
       | 
       | 1 https://github.com/huggingface/speech-to-speech
        
         | dmezzetti wrote:
         | A similar option exists with txtai
         | (https://github.com/neuml/txtai).
         | 
         | https://neuml.hashnode.dev/speech-to-speech-rag
         | 
         | https://www.youtube.com/watch?v=tH8QWwkVMKA
         | 
         | One would just need to remove the RAG piece and use a
         | Translation pipeline
         | (https://neuml.github.io/txtai/pipeline/text/translation/).
         | They'd also need to use a Korean TTS model.
         | 
         | Both this and the Hugging Face speech-to-speech projects are
         | Python though.
        
           | authorfly wrote:
           | Your library is quite possibly the best example of effortful,
           | understandable and useful work I have ever seen - principally
           | evidenced by how you keep evolving with the times. I've seen
           | you keep it up to date and even on the edge now for years and
           | through multiple NLP mini-revolutions (sentence
           | embeddings/new uses) and what must have been the annoying
           | release of LLMs and still push on to have an explainable and
           | useful library.
           | 
           | Code from txtai just feels like exactly the right way to
           | express what I am usually trying to do in NLP.
           | 
           | My highest commendations. If you ever have time, please share
           | your experience/what lead to you taking this path with txtai.
           | For example I see you started in earnest around August 2020
           | (maybe before) - at that time i would love to know if you
           | imagined LLMs coming on to be as prominent as they are now
           | and for instruction-tuning to work as well as it is. I know
           | at that time many PhD students I knew in NLP (and profs) felt
           | LLMs were far too unreliable and would not reach e.g.
           | consistent scores on MMLU/HELLASWAG.
        
             | dmezzetti wrote:
             | I really appreciate that! Thank you.
             | 
             | It's been quite a ride from 2020. When I started txtai, the
             | first use case was RAG in a way. Except instead of an LLM,
             | it used an extractive QA model. But it was really the same
             | idea, get a relevant context then find the useful
             | information in it. LLMs just made it much more "creative".
             | 
             | Right before ChatGPT, I was working on semantic graphs.
             | That took the wind out of the sails on that for a while
             | until GraphRAG came along. Definitely was a detour adding
             | the LLM framework into txtai during 2023.
             | 
             | The next release will be a major release (8.0) with agent
             | support (https://github.com/neuml/txtai/issues/804). I've
             | been hesitant to buy into the "agentic" hype as it seems
             | quite convoluted and complicated at this point. But I
             | believe there are some wins available.
             | 
             | In 2024, it's hard to get noticed. There are tons of RAG
             | and Agent frameworks. Sometimes you see something trend and
             | surge past txtai in terms of stars in a matter of days.
             | txtai has 10% of the stars of LangChain but I feel it
             | competes with it quite well.
             | 
             | Nonetheless I keep chugging along because I believe in the
             | project and that it can solve real-world use cases better
             | than many other options.
        
               | okwhateverdude wrote:
               | I have a dozen or so tabs open at the moment to wrap my
               | head around txtai and its very broad feature set. The
               | plethora of examples is nice even if the python idioms
               | are dense. The semantic graph bits are of keen interest
               | for my use case, as are the pipelines and workflows. I
               | really appreciate you continuing to hack on this.
        
               | dmezzetti wrote:
               | You got it. Hopefully the project continues it's slow
               | growth trajectory.
        
       | _gmax0 wrote:
       | Off topic, but what's the state-of-art behind speech recognition
       | models at the moment?
       | 
       | Are people still using with DTW + HMMs?
        
         | IshKebab wrote:
         | HMMs haven't been state of the art in speech recognition for
         | decades (I.e. since it actually got good). It's all end-to-end
         | DNNs now. Basically raw input -> DNN -> ASCII.
         | 
         | Well almost anyway - last I checked they feed a Mel spectrogram
         | into the model rather than raw audio samples.
        
           | pcwelder wrote:
           | > state of the art in speech recognition for decades
           | 
           | Decades doesn't sound right. Around 2019, the Jasper model
           | was SOTA among e2e models but was still slightly behind a non
           | e2e model with an HMM component
           | https://arxiv.org/pdf/1904.03288
        
       | autumnstwilight wrote:
       | Is this possible to do smoothly with languages that have an
       | extremely different grammar to English? If you need to wait until
       | the end of the sentence to get the verb, for instance, then that
       | could take more than five seconds, particularly if someone is
       | speaking off the cuff with hesitations and pauses (Or you could
       | translate clauses as they come in, but in some situations you'll
       | end up with a garbled translation because the end of the sentence
       | provides information that affects your earlier translation
       | choices).
       | 
       | AFAIK, humans who do simultaneous interpretation are provided
       | with at least an outline, if not full script, of what the speaker
       | intends to say, so they can predict what's coming next.
        
         | thangalin wrote:
         | > If you need to wait until the end of the sentence to get the
         | verb, for instance, then that could take more than five seconds
         | 
         | I meant a five-second delay after the speaker finishes talking
         | or the user taps a button to start the translation, not
         | necessarily a five-second rolling window.
        
         | Gathering6678 wrote:
         | > AFAIK, humans who do simultaneous interpretation are provided
         | with at least an outline, if not full script
         | 
         | They are usually provided with one, but it is by no means
         | necessary. SI is never truly simultaneous and will have a
         | delay, and the interpretor will also predict based on the
         | context. Which makes certain languages a bit more difficult to
         | work with, e.g. Japanese, sentences of which I believe often
         | have the predicate after the object, rather than the usual
         | subject-predicate-object order, making the "prediction" part
         | harder.
        
       | jansan wrote:
       | I just realized I will actually see a real Babelfish hitting the
       | market in my lifetime. Amazing times indeed.
        
         | billylo wrote:
         | The tech is really here. This summer, I was fascinated the
         | accuracy by the spoken language auto-detection capabilities. It
         | really works and it only needs 1-2 seconds to catch the nuiance
         | of a specific language.
         | 
         | So, I ventured into building 3PO. https://3po.evergreen-
         | labs.org
         | 
         | Would love to hear everyone's feedback here.
        
       | tkgally wrote:
       | It's not free, but I've had some success using ChatGPT's Advanced
       | Voice mode for sequential interpreting between English and
       | Japanese. I found I had to first explain the situation to the
       | model and tell it what I wanted it to do. For example: "I am
       | going to have a conversation with my friend Taro. I speak
       | English, and he speaks Japanese. Translate what I say into
       | Japanese and what he says into English. Only translate what we
       | say. Do not add any explanations or commentary."
       | 
       | We had to be careful not to talk over each other or the model,
       | and the interpreting didn't work well in a noisy environment. But
       | once we got things set up and had practiced a bit, the
       | conversations went smoothly. The accuracy of the translations was
       | very good.
       | 
       | Such interpreting should get even better once the models have
       | live visual input so that they can "see" the speakers' gestures
       | and facial expressions. Hosting on local devices, for less
       | latency, will help as well.
       | 
       | In business and government contexts, professional human
       | interpreters are usually provided with background information in
       | advance so that they understand what people are talking about and
       | know how to translate specialized vocabulary. LLMs will need
       | similar preparation for interpreting in serious contexts.
        
         | bemmu wrote:
         | I tried this a few times since my family speaks Finnish and son
         | speaks Japanese, but the issue is that it keeps forgetting the
         | task.
         | 
         | It'll work at first for a sentence or two, then the other party
         | asks something and instead of translating the question, it will
         | attempt to answer the question. Even if you remind it of its
         | task it quickly forgets again.
        
           | tkgally wrote:
           | That happened once or twice for me, too. I wonder if an
           | interpreting-specific system prompt would prevent that
           | problem....
        
       | Terr_ wrote:
       | > * https://github.com/openai/whisper
       | 
       | I would be very concerned about any LLM model being used for
       | "transcription", since they may injecting things that nobody
       | said, as in this recent item:
       | 
       | https://news.ycombinator.com/item?id=41968191
        
         | ben_w wrote:
         | They list the error rate on the git repo directly, it was never
         | _good_ even when it was _the best_.
         | 
         | I saw mediocre results from the biggest model even when I gave
         | it a video of Tom Scott speaking at the Royal Institution where
         | I could be extremely confident about the quality of the
         | recording.
        
           | mmcwilliams wrote:
           | WER is a decent metric to compare models but there's a
           | difference between mistranscribing "effect" for "affect" and
           | the kind of hallucinations Whisper has. I've run thousands of
           | hours of audio through it for comparisons to other models and
           | the kinds of thing you see Whisper inventing out of whole
           | cloth is phrases like "please like and subscribe" in periods
           | of silence. To me it suggested that it's trained off a lot of
           | YouTube.
        
             | ben_w wrote:
             | Interesting; that's certainly a bigger baseline than the
             | one hour or so that I tried, which wasn't big enough to
             | reveal that.
        
       | thrdbndndn wrote:
       | > free
       | 
       | > offline
       | 
       | > real-time
       | 
       | > speech-to-speech translation app
       | 
       | > on under-powered devices
       | 
       | I genuinely don't think the technology is there.
       | 
       | I can't even find a half-good real-time "speech to second
       | language text" tool, not even with "paid/online/on powerful
       | device" options.
        
         | maeil wrote:
         | > I genuinely don't think the technology is there.
         | 
         | Definitely true for OP's case, especially for non-trivial
         | language pairs. For the best case scenario, e.g.
         | English<>German, we can probably get close.
         | 
         | > I can't even find a half-good real-time "speech to second
         | language text" tool, not even with "paid/online/on powerful
         | device" options.
         | 
         | As in "you speak and it streams the translated text"?
         | translate.google.com with voice input and a more mobile-
         | friendly UI?
        
           | thrdbndndn wrote:
           | The problem with Google is its translation quality. Not sure
           | about Korean, but Japanese/English (either way) definitely
           | isn't there.
           | 
           | For Japanese to English, the transcription alone is already
           | pretty inaccurate (usable if you know some Japanese; but then
           | again you already know Japanese!)
        
             | sebzim4500 wrote:
             | Interesting, I was in Japan a few months ago and I found
             | google translate to be pretty good. Even when Hotels etc.
             | provided information in English I found it was better to
             | use google lens on the Japanese information.
             | 
             | I can't say much about the quality of English -> Japanese
             | translation, except that people were generally able to
             | understand whatever came out of it.
        
               | thrdbndndn wrote:
               | It's usable as a tool for quick communication or reading
               | instructional text.
               | 
               | But don't expect to be able to use it to read actual
               | literature or, back to the topic, subtitling a TV series
               | or a YouTube video without misunderstanding.
        
       | sva_ wrote:
       | Samsung Interpreter might be the closest, but is neither free nor
       | does it work on low-power devices
        
       | alexisread wrote:
       | This phone has been around for ages, and does the job. It's well
       | weapon! https://www.neatoshop.com/product/The-Wasp-T12-Speechtool
        
       | nacnud wrote:
       | A friend recommends SayHi, which does near-realtime speech-to-
       | speech translation (https://play.google.com/store/apps/details?id
       | =com.sayhi.app&...). Unfortunately it's not offline though.
        
         | itake wrote:
         | Is that the same app? That seems like a social/dating app. This
         | reddit thread suggests the SayHi app was discontinued
         | 
         | https://www.reddit.com/r/language/comments/1elpv37/why_is_sa...
        
       | bbstats wrote:
       | doesnt google translate do this?
        
         | itake wrote:
         | maybe... Google doesn't support smaller languages for STT.
         | 
         | My other gripe with these tools is if there is background
         | noise, they are pretty useless. You can't use them in a crowded
         | room.
        
           | squarefoot wrote:
           | In most noisy contexts, a throat microphone should work [1].
           | A small Bluetooth one could also connect to its small
           | earpiece(s) to make a wearable speech interface whose bigger
           | battery could be concealed in the mic and last much longer
           | than usual earbuds.
           | 
           | [1]: https://en.wikipedia.org/wiki/Throat_microphone
        
             | itake wrote:
             | When I usually need a translation app is when I am
             | traveling. I can't just ask a stranger to tape a throat mic
             | and wear my headphones to have a conversation with me
             | though
        
       | bool3max wrote:
       | FREE and OFFLINE and OPEN SOURCE and REAL-TIME on UNDER-POWERED
       | devices?
        
         | turnsout wrote:
         | This is why it would be tough to be an AI startup in 2024...
         | totally unrealistic customer expectations
        
           | egoisticalgoat wrote:
           | Which usually stem from AI startups making wildly unrealistic
           | promises; it's all a very unfun cat and mouse game.
        
         | userbinator wrote:
         | I think it's certainly possible if you compromise on accuracy.
         | https://en.wikipedia.org/wiki/Dragon_NaturallySpeaking has been
         | around since the late 90s, there's various (rather robotic-
         | sounding) free speech synths available which don't require much
         | processing power at all (look at the system requirements of
         | https://en.wikipedia.org/wiki/Software_Automatic_Mouth ), and
         | of course machine translation has been an active topic of
         | research since the mid-20th century.
         | 
         | IMHO it's unfortunate that everyone jumps to "use AI!" as the
         | default now, when very competitive approaches that have been
         | developed over the past few decades could provide decent
         | results but at a fraction of the computing resources, i.e. a
         | much higher efficiency.
        
         | ccozan wrote:
         | Yes but why OFFLINE ? today's world is so super connected that
         | I am wondering why is asking for this requirement.
        
           | thangalin wrote:
           | > why OFFLINE
           | 
           | Why online? Why would I want some third-party to (a) listen
           | to my conversations; (b) receive a copy of my voice that
           | hackers could download; (c) analyze my private conversations
           | for marketing purposes; (d) hobble my ability to translate
           | when their system goes down, or permanently offline; or (e)
           | require me to pay for a software service that's feasible to
           | run locally on a smart phone?
           | 
           | Why would I want to have my ability to translate tied to
           | internet connectivity? Routers can fail. Rural areas can be
           | spotty. Cell towers can be downed by hurricanes. Hikes can
           | take people out of cell tower range. People are not always
           | inside of a city.
        
           | itake wrote:
           | Hosting is annoyingly expensive. ping latency between us-
           | east-1 and ap-southeast-1 is 230ms. So you either setup shop
           | in one location or go multi-region (which adds up).
           | 
           | Also, there are many environments (especially when you
           | travel) where your phone is not readily connected.
        
       | lma21 wrote:
       | Real-time and under-powered, no way. All the available tools (and
       | models) today require non-negligible hardware.
        
       | NickC25 wrote:
       | >Although real-time would be great, a max 5-second delay is
       | manageable.
       | 
       | Humans can't even do this in immediate real-time, what makes you
       | think a computer can? Some of the best real-time translators that
       | work at the UN or for governments still have a short delay to be
       | able to correctly interpret and translate for accuracy and
       | context. Doing so in real-time actually impedes the translator
       | from working correctly - especially in languages that have
       | different grammatical structures. Even in langauges that are
       | effectively congruent (think Latin derivatives), this is hard, if
       | not outright impossible to do in real time.
       | 
       | I worked in the field of language education and computer science.
       | The tech you're hoping would be free and able to run on older
       | devices is easily a decade away at the very best. As for it being
       | offline, yeah, no. Not going to happen, because accurate real-
       | time translation of even a database of the 20 most common
       | languages on earth is probably a few terrabytes at the very
       | least.
        
       | ladidahh wrote:
       | Only seems to cover half of what you're asking for... Starred
       | this the other day and haven't gotten to trying it out :
       | 
       | https://github.com/usefulsensors/moonshine
        
       | sahbasanai wrote:
       | It is impossible to accurately interpret with a max 5 second
       | delay. The structure of some languages requires the interpreter
       | to occasionally wait for the end of a statement being the start
       | of interpretation is possible.
        
       | Fairburn wrote:
       | 'Meanwhile, the poor Babel fish, by effectively removing all
       | barriers to communication between different races and cultures,
       | has caused more and bloodier wars than anything else in the
       | history of creation.'
        
       | EngineerDraft wrote:
       | I've develop an macOS App: BeMyEars which can realtime speech-to-
       | text translation. It first transcribe and then translate between
       | language. All of this is working on-device. If you only want
       | smart phone app: you can also try YPlayer, it's also working on-
       | device. They can be downloaded from AppStore.
        
       ___________________________________________________________________
       (page generated 2024-10-29 23:00 UTC)