[HN Gopher] Ask HN: Real-time speech-to-speech translation
___________________________________________________________________
Ask HN: Real-time speech-to-speech translation
Has anyone had any luck with a free, offline, open-source, real-
time speech-to-speech translation app on under-powered devices
(i.e., older smart phones)? *
https://github.com/ictnlp/StreamSpeech *
https://github.com/k2-fsa/sherpa-onnx *
https://github.com/openai/whisper I'm looking for a simple app
that can listen for English, translate into Korean (and other
languages), then perform speech synthesis on the translation.
Basically, a Babelfish that doesn't stick in the ear. Although
real-time would be great, a max 5-second delay is manageable.
RTranslator is awkward (couldn't get it to perform speech-to-speech
using a single phone). 3PO sprouts errors like dandelions and
requires an online connection. Any suggestions?
Author : thangalin
Score : 131 points
Date : 2024-10-25 02:55 UTC (4 days ago)
| ohlookcake wrote:
| I've been looking for something like this (Not for Korean though)
| and I'd even be happy to pay - though I'd prefer to pay by usage
| rather than a standing subscription fee. So far, no luck, but
| watching this thread!
| thangalin wrote:
| RTranslator is quite close. It needs a TTS for the target
| language to be installed.
| billylo wrote:
| Author of 3PO here: check out our latest version 2.12. Many fixes
| have been incorporated in the past two weeks. Cheers.
| thangalin wrote:
| > Many fixes have been incorporated in the past two weeks.
|
| Thanks for the efforts! Still many fixes to go, though: I used
| the version from two days ago, which had numerous issues. Also,
| 3PO isn't offline, so I won't be pursuing it.
| billylo wrote:
| No problem! Some fixes were on the server side. We had a
| server side issue a couple days ago for a few hours. You may
| have been affected by it, giving you those errors. They have
| been fixed too. Take care!
| supermatt wrote:
| It works really well for me, but I wish it supported more
| languages for the input - i guess this is a limitation of the
| model you are using? Do you mind giving some info about the
| tech stack? I love how fast it is.
| billylo wrote:
| thanks for the kind words!! It is built on top of a fairly
| straightforward infrastructure. Client side is C#, .NET,
| MAUI. Server side is Firebase Cloud Functions/Firestore/Web
| Hosting + Azure.
| supermatt wrote:
| Thanks - I was wondering more about the translation
| pipeline - i am assuming something like whisper and m2m100?
| How did you manage to get the latency so low? Other
| examples I have seen feel really sluggish in comparison.
| billylo wrote:
| Fairly basic there... chunking audio (websocket) and send
| them for translation sequentially.
| yuryk wrote:
| moonshine?
| Fairburn wrote:
| https://news.ycombinator.com/item?id=41960085
| barrenko wrote:
| Expecting a boom in the speech-to-speech market in the following
| months. It's the next thing.
| jillesvangurp wrote:
| It's been the next thing ever since Bill Gates said so around
| the time of the Windows 95 launch.
|
| But it does feel like we are close to getting a proper
| babelfish type solution. The models are good enough now.
| Especially the bigger ones. It's all about UX and packaging it
| up now.
| v7n wrote:
| It's not exactly what OP wants out-of-the-box, but if anyone is
| considering building one I suggest taking a look at this.1 It is
| really easy to tinker with, can run both on devide or in a
| client-server model. It has the required speech-to-text and text-
| to-speech endpoints, with multiple options for each built-in. If
| you can make the LLM AI assistant part of the pipeline to perform
| translation to a degree you're comfortable with, this could be a
| solution.
|
| 1 https://github.com/huggingface/speech-to-speech
| dmezzetti wrote:
| A similar option exists with txtai
| (https://github.com/neuml/txtai).
|
| https://neuml.hashnode.dev/speech-to-speech-rag
|
| https://www.youtube.com/watch?v=tH8QWwkVMKA
|
| One would just need to remove the RAG piece and use a
| Translation pipeline
| (https://neuml.github.io/txtai/pipeline/text/translation/).
| They'd also need to use a Korean TTS model.
|
| Both this and the Hugging Face speech-to-speech projects are
| Python though.
| authorfly wrote:
| Your library is quite possibly the best example of effortful,
| understandable and useful work I have ever seen - principally
| evidenced by how you keep evolving with the times. I've seen
| you keep it up to date and even on the edge now for years and
| through multiple NLP mini-revolutions (sentence
| embeddings/new uses) and what must have been the annoying
| release of LLMs and still push on to have an explainable and
| useful library.
|
| Code from txtai just feels like exactly the right way to
| express what I am usually trying to do in NLP.
|
| My highest commendations. If you ever have time, please share
| your experience/what lead to you taking this path with txtai.
| For example I see you started in earnest around August 2020
| (maybe before) - at that time i would love to know if you
| imagined LLMs coming on to be as prominent as they are now
| and for instruction-tuning to work as well as it is. I know
| at that time many PhD students I knew in NLP (and profs) felt
| LLMs were far too unreliable and would not reach e.g.
| consistent scores on MMLU/HELLASWAG.
| dmezzetti wrote:
| I really appreciate that! Thank you.
|
| It's been quite a ride from 2020. When I started txtai, the
| first use case was RAG in a way. Except instead of an LLM,
| it used an extractive QA model. But it was really the same
| idea, get a relevant context then find the useful
| information in it. LLMs just made it much more "creative".
|
| Right before ChatGPT, I was working on semantic graphs.
| That took the wind out of the sails on that for a while
| until GraphRAG came along. Definitely was a detour adding
| the LLM framework into txtai during 2023.
|
| The next release will be a major release (8.0) with agent
| support (https://github.com/neuml/txtai/issues/804). I've
| been hesitant to buy into the "agentic" hype as it seems
| quite convoluted and complicated at this point. But I
| believe there are some wins available.
|
| In 2024, it's hard to get noticed. There are tons of RAG
| and Agent frameworks. Sometimes you see something trend and
| surge past txtai in terms of stars in a matter of days.
| txtai has 10% of the stars of LangChain but I feel it
| competes with it quite well.
|
| Nonetheless I keep chugging along because I believe in the
| project and that it can solve real-world use cases better
| than many other options.
| okwhateverdude wrote:
| I have a dozen or so tabs open at the moment to wrap my
| head around txtai and its very broad feature set. The
| plethora of examples is nice even if the python idioms
| are dense. The semantic graph bits are of keen interest
| for my use case, as are the pipelines and workflows. I
| really appreciate you continuing to hack on this.
| dmezzetti wrote:
| You got it. Hopefully the project continues it's slow
| growth trajectory.
| _gmax0 wrote:
| Off topic, but what's the state-of-art behind speech recognition
| models at the moment?
|
| Are people still using with DTW + HMMs?
| IshKebab wrote:
| HMMs haven't been state of the art in speech recognition for
| decades (I.e. since it actually got good). It's all end-to-end
| DNNs now. Basically raw input -> DNN -> ASCII.
|
| Well almost anyway - last I checked they feed a Mel spectrogram
| into the model rather than raw audio samples.
| pcwelder wrote:
| > state of the art in speech recognition for decades
|
| Decades doesn't sound right. Around 2019, the Jasper model
| was SOTA among e2e models but was still slightly behind a non
| e2e model with an HMM component
| https://arxiv.org/pdf/1904.03288
| autumnstwilight wrote:
| Is this possible to do smoothly with languages that have an
| extremely different grammar to English? If you need to wait until
| the end of the sentence to get the verb, for instance, then that
| could take more than five seconds, particularly if someone is
| speaking off the cuff with hesitations and pauses (Or you could
| translate clauses as they come in, but in some situations you'll
| end up with a garbled translation because the end of the sentence
| provides information that affects your earlier translation
| choices).
|
| AFAIK, humans who do simultaneous interpretation are provided
| with at least an outline, if not full script, of what the speaker
| intends to say, so they can predict what's coming next.
| thangalin wrote:
| > If you need to wait until the end of the sentence to get the
| verb, for instance, then that could take more than five seconds
|
| I meant a five-second delay after the speaker finishes talking
| or the user taps a button to start the translation, not
| necessarily a five-second rolling window.
| Gathering6678 wrote:
| > AFAIK, humans who do simultaneous interpretation are provided
| with at least an outline, if not full script
|
| They are usually provided with one, but it is by no means
| necessary. SI is never truly simultaneous and will have a
| delay, and the interpretor will also predict based on the
| context. Which makes certain languages a bit more difficult to
| work with, e.g. Japanese, sentences of which I believe often
| have the predicate after the object, rather than the usual
| subject-predicate-object order, making the "prediction" part
| harder.
| jansan wrote:
| I just realized I will actually see a real Babelfish hitting the
| market in my lifetime. Amazing times indeed.
| billylo wrote:
| The tech is really here. This summer, I was fascinated the
| accuracy by the spoken language auto-detection capabilities. It
| really works and it only needs 1-2 seconds to catch the nuiance
| of a specific language.
|
| So, I ventured into building 3PO. https://3po.evergreen-
| labs.org
|
| Would love to hear everyone's feedback here.
| tkgally wrote:
| It's not free, but I've had some success using ChatGPT's Advanced
| Voice mode for sequential interpreting between English and
| Japanese. I found I had to first explain the situation to the
| model and tell it what I wanted it to do. For example: "I am
| going to have a conversation with my friend Taro. I speak
| English, and he speaks Japanese. Translate what I say into
| Japanese and what he says into English. Only translate what we
| say. Do not add any explanations or commentary."
|
| We had to be careful not to talk over each other or the model,
| and the interpreting didn't work well in a noisy environment. But
| once we got things set up and had practiced a bit, the
| conversations went smoothly. The accuracy of the translations was
| very good.
|
| Such interpreting should get even better once the models have
| live visual input so that they can "see" the speakers' gestures
| and facial expressions. Hosting on local devices, for less
| latency, will help as well.
|
| In business and government contexts, professional human
| interpreters are usually provided with background information in
| advance so that they understand what people are talking about and
| know how to translate specialized vocabulary. LLMs will need
| similar preparation for interpreting in serious contexts.
| bemmu wrote:
| I tried this a few times since my family speaks Finnish and son
| speaks Japanese, but the issue is that it keeps forgetting the
| task.
|
| It'll work at first for a sentence or two, then the other party
| asks something and instead of translating the question, it will
| attempt to answer the question. Even if you remind it of its
| task it quickly forgets again.
| tkgally wrote:
| That happened once or twice for me, too. I wonder if an
| interpreting-specific system prompt would prevent that
| problem....
| Terr_ wrote:
| > * https://github.com/openai/whisper
|
| I would be very concerned about any LLM model being used for
| "transcription", since they may injecting things that nobody
| said, as in this recent item:
|
| https://news.ycombinator.com/item?id=41968191
| ben_w wrote:
| They list the error rate on the git repo directly, it was never
| _good_ even when it was _the best_.
|
| I saw mediocre results from the biggest model even when I gave
| it a video of Tom Scott speaking at the Royal Institution where
| I could be extremely confident about the quality of the
| recording.
| mmcwilliams wrote:
| WER is a decent metric to compare models but there's a
| difference between mistranscribing "effect" for "affect" and
| the kind of hallucinations Whisper has. I've run thousands of
| hours of audio through it for comparisons to other models and
| the kinds of thing you see Whisper inventing out of whole
| cloth is phrases like "please like and subscribe" in periods
| of silence. To me it suggested that it's trained off a lot of
| YouTube.
| ben_w wrote:
| Interesting; that's certainly a bigger baseline than the
| one hour or so that I tried, which wasn't big enough to
| reveal that.
| thrdbndndn wrote:
| > free
|
| > offline
|
| > real-time
|
| > speech-to-speech translation app
|
| > on under-powered devices
|
| I genuinely don't think the technology is there.
|
| I can't even find a half-good real-time "speech to second
| language text" tool, not even with "paid/online/on powerful
| device" options.
| maeil wrote:
| > I genuinely don't think the technology is there.
|
| Definitely true for OP's case, especially for non-trivial
| language pairs. For the best case scenario, e.g.
| English<>German, we can probably get close.
|
| > I can't even find a half-good real-time "speech to second
| language text" tool, not even with "paid/online/on powerful
| device" options.
|
| As in "you speak and it streams the translated text"?
| translate.google.com with voice input and a more mobile-
| friendly UI?
| thrdbndndn wrote:
| The problem with Google is its translation quality. Not sure
| about Korean, but Japanese/English (either way) definitely
| isn't there.
|
| For Japanese to English, the transcription alone is already
| pretty inaccurate (usable if you know some Japanese; but then
| again you already know Japanese!)
| sebzim4500 wrote:
| Interesting, I was in Japan a few months ago and I found
| google translate to be pretty good. Even when Hotels etc.
| provided information in English I found it was better to
| use google lens on the Japanese information.
|
| I can't say much about the quality of English -> Japanese
| translation, except that people were generally able to
| understand whatever came out of it.
| thrdbndndn wrote:
| It's usable as a tool for quick communication or reading
| instructional text.
|
| But don't expect to be able to use it to read actual
| literature or, back to the topic, subtitling a TV series
| or a YouTube video without misunderstanding.
| sva_ wrote:
| Samsung Interpreter might be the closest, but is neither free nor
| does it work on low-power devices
| alexisread wrote:
| This phone has been around for ages, and does the job. It's well
| weapon! https://www.neatoshop.com/product/The-Wasp-T12-Speechtool
| nacnud wrote:
| A friend recommends SayHi, which does near-realtime speech-to-
| speech translation (https://play.google.com/store/apps/details?id
| =com.sayhi.app&...). Unfortunately it's not offline though.
| itake wrote:
| Is that the same app? That seems like a social/dating app. This
| reddit thread suggests the SayHi app was discontinued
|
| https://www.reddit.com/r/language/comments/1elpv37/why_is_sa...
| bbstats wrote:
| doesnt google translate do this?
| itake wrote:
| maybe... Google doesn't support smaller languages for STT.
|
| My other gripe with these tools is if there is background
| noise, they are pretty useless. You can't use them in a crowded
| room.
| squarefoot wrote:
| In most noisy contexts, a throat microphone should work [1].
| A small Bluetooth one could also connect to its small
| earpiece(s) to make a wearable speech interface whose bigger
| battery could be concealed in the mic and last much longer
| than usual earbuds.
|
| [1]: https://en.wikipedia.org/wiki/Throat_microphone
| itake wrote:
| When I usually need a translation app is when I am
| traveling. I can't just ask a stranger to tape a throat mic
| and wear my headphones to have a conversation with me
| though
| bool3max wrote:
| FREE and OFFLINE and OPEN SOURCE and REAL-TIME on UNDER-POWERED
| devices?
| turnsout wrote:
| This is why it would be tough to be an AI startup in 2024...
| totally unrealistic customer expectations
| egoisticalgoat wrote:
| Which usually stem from AI startups making wildly unrealistic
| promises; it's all a very unfun cat and mouse game.
| userbinator wrote:
| I think it's certainly possible if you compromise on accuracy.
| https://en.wikipedia.org/wiki/Dragon_NaturallySpeaking has been
| around since the late 90s, there's various (rather robotic-
| sounding) free speech synths available which don't require much
| processing power at all (look at the system requirements of
| https://en.wikipedia.org/wiki/Software_Automatic_Mouth ), and
| of course machine translation has been an active topic of
| research since the mid-20th century.
|
| IMHO it's unfortunate that everyone jumps to "use AI!" as the
| default now, when very competitive approaches that have been
| developed over the past few decades could provide decent
| results but at a fraction of the computing resources, i.e. a
| much higher efficiency.
| ccozan wrote:
| Yes but why OFFLINE ? today's world is so super connected that
| I am wondering why is asking for this requirement.
| thangalin wrote:
| > why OFFLINE
|
| Why online? Why would I want some third-party to (a) listen
| to my conversations; (b) receive a copy of my voice that
| hackers could download; (c) analyze my private conversations
| for marketing purposes; (d) hobble my ability to translate
| when their system goes down, or permanently offline; or (e)
| require me to pay for a software service that's feasible to
| run locally on a smart phone?
|
| Why would I want to have my ability to translate tied to
| internet connectivity? Routers can fail. Rural areas can be
| spotty. Cell towers can be downed by hurricanes. Hikes can
| take people out of cell tower range. People are not always
| inside of a city.
| itake wrote:
| Hosting is annoyingly expensive. ping latency between us-
| east-1 and ap-southeast-1 is 230ms. So you either setup shop
| in one location or go multi-region (which adds up).
|
| Also, there are many environments (especially when you
| travel) where your phone is not readily connected.
| lma21 wrote:
| Real-time and under-powered, no way. All the available tools (and
| models) today require non-negligible hardware.
| NickC25 wrote:
| >Although real-time would be great, a max 5-second delay is
| manageable.
|
| Humans can't even do this in immediate real-time, what makes you
| think a computer can? Some of the best real-time translators that
| work at the UN or for governments still have a short delay to be
| able to correctly interpret and translate for accuracy and
| context. Doing so in real-time actually impedes the translator
| from working correctly - especially in languages that have
| different grammatical structures. Even in langauges that are
| effectively congruent (think Latin derivatives), this is hard, if
| not outright impossible to do in real time.
|
| I worked in the field of language education and computer science.
| The tech you're hoping would be free and able to run on older
| devices is easily a decade away at the very best. As for it being
| offline, yeah, no. Not going to happen, because accurate real-
| time translation of even a database of the 20 most common
| languages on earth is probably a few terrabytes at the very
| least.
| ladidahh wrote:
| Only seems to cover half of what you're asking for... Starred
| this the other day and haven't gotten to trying it out :
|
| https://github.com/usefulsensors/moonshine
| sahbasanai wrote:
| It is impossible to accurately interpret with a max 5 second
| delay. The structure of some languages requires the interpreter
| to occasionally wait for the end of a statement being the start
| of interpretation is possible.
| Fairburn wrote:
| 'Meanwhile, the poor Babel fish, by effectively removing all
| barriers to communication between different races and cultures,
| has caused more and bloodier wars than anything else in the
| history of creation.'
| EngineerDraft wrote:
| I've develop an macOS App: BeMyEars which can realtime speech-to-
| text translation. It first transcribe and then translate between
| language. All of this is working on-device. If you only want
| smart phone app: you can also try YPlayer, it's also working on-
| device. They can be downloaded from AppStore.
___________________________________________________________________
(page generated 2024-10-29 23:00 UTC)