[HN Gopher] Kokoro WebGPU: Real-time text-to-speech 100% locally...
___________________________________________________________________
Kokoro WebGPU: Real-time text-to-speech 100% locally in the browser
Author : xenova
Score : 163 points
Date : 2025-02-07 15:30 UTC (7 hours ago)
(HTM) web link (huggingface.co)
(TXT) w3m dump (huggingface.co)
| xenova wrote:
| It took some time, but we finally got Kokoro TTS (v1.0) running
| in-browser w/ WebGPU acceleration! This enables real-time text-
| to-speech without the need for a server. Looking forward to your
| feedback!
| deivid wrote:
| Amazing! I'm interested in models running locally and Kokoro
| seems amazing. Are you aware of similar models but for Speech
| to text?
| Ono-Sendai wrote:
| whisper
| xenova wrote:
| We have released a bunch of speech recognition demos (using
| whisper, moonshine, and others). For example:
|
| - https://huggingface.co/spaces/Xenova/whisper-web
|
| - https://huggingface.co/spaces/Xenova/whisper-webgpu
|
| - https://huggingface.co/spaces/Xenova/realtime-whisper-
| webgpu
|
| - https://huggingface.co/spaces/webml-community/moonshine-web
| amelius wrote:
| Now that's what I call "server-less" computing!
| sebastiennight wrote:
| This is brilliant. All we need now is for someone to code a
| frontend for it so we can input an article's URL and have this
| voice read it out loud... built-in local voices on MacOS are
| not even close to this Kokoro model
| waynenilsen wrote:
| Incredible work! I have listened to several tts and to have this
| be free and in complete control of the customer is absolutely
| incredible. This will unlock new use cases
|
| I made https://app.readaloudto.me/ as a hobby thing and now it
| could be enhanced with a local tts option!
| rado wrote:
| Crashes the iPad Safari tab
| zamadatix wrote:
| Mobile Safari (includes iPad) does not like to dish out large
| amounts of memory.
| dindresto wrote:
| Same on macOS Safari (Sequoia, Safari 18.3, M3 Pro, 18gb RAM)
| oliwary wrote:
| Worked on my Pixel 6a, albeit quite slowly (~30s for 4s audio).
| Still really impressed.
| darkwater wrote:
| Yep, same here. Pixel 6a and Firefox, it takes a while but it
| sounds pretty good
| reach-vb wrote:
| Brilliant job! Love how fast it is, I'm sure if the rapid pace of
| speech ML continues we'll have Speech to Speech models directly
| running in our browser!
| dust42 wrote:
| It's already there, Hibiki by Kyutai.org was released yesterday
| with speech to speech, french to english on Iphone:
|
| https://x.com/neilzegh/status/1887498102455869775
|
| https://github.com/kyutai-labs/hibiki
| fallinditch wrote:
| Brave browser and Samsung Galaxy S22 ultra - gives horrible
| screeching noises
| Guillaume86 wrote:
| Same with chrome on Zenfone 8
| shaneofalltrad wrote:
| same in MacOS intel Chrome browser.
| magicalhippo wrote:
| Firefox on Samsung S21, worked fine albeit slow, around 20-25s
| for the demo text.
|
| Quality sounded good compared to a lot of other small TTS
| models I've tried.
| nnadams wrote:
| Yeah this only worked with Firefox on my phone. All other
| browsers generated a screechy noise instead.
| djeastm wrote:
| Fyi I tried this on my Galaxy S21 with both Brave and Chrome
| browsers and just got screeching noises in the audio
| mewse-hn wrote:
| the mere idea of voice software's error mode being
| uncontrollable screeching is the most hilarious thing to me
| Asmod4n wrote:
| Sounds horribly in chrome with an amd gpu, why is that?
| mdaniel wrote:
| Are you somehow implying that everyone in the AI arms race
| believes that only CUDA exists?! /s
|
| But, in a more serious tone: the story that I hear about AMD
| GPUs is that they are, in fact, shittier because AMD themselves
| give fewer shits. GIGO
| CyberDildonics wrote:
| What is this comment saying? You think the results are
| different just because of AMD hardware? If there is a
| difference it would be a software bug.
| bentt wrote:
| Sounded perfect for me. Brave/Win11/3090
| SubiculumCode wrote:
| Kokoro gives pretty good voices and is quite light...making it
| useful despite its lack of voice cloning capability. However, I
| haven't figured out how to run it in the context of a tts server
| without homebrewing the server...which maybe is easy? IDK.
| phildougherty wrote:
| https://github.com/remsky/Kokoro-FastAPI
| yawnxyz wrote:
| holy cow, how did they get the OpenAI voices like Alloy and Echo,
| generated in-browser and sounding 99% the same?
|
| this is astounding
| butz wrote:
| Generating audio takes a bit, but wow, 92MB model for really
| decent sounding speech. Is there a way to plug this thing into
| speech dispatcher on Linux and use for accessibility?
| realsid wrote:
| Amazing ! This is my first time witnessing a model of such
| prowess run in browser. Curious about quantization and webml
| moralestapia wrote:
| This is great but far from real-time.
|
| (I get the joke that for some definition of real-time this is
| real-time).
|
| The reason why I use an API is because time to first byte is the
| most important metric in the apps I'm working on.
|
| That aside, kudos for the great work and I'm sure one day the
| latency on this will be super low as well.
___________________________________________________________________
(page generated 2025-02-07 23:01 UTC)