[HN Gopher] TTS: Text-to-Speech for All
       ___________________________________________________________________
        
       TTS: Text-to-Speech for All
        
       Author : doener
       Score  : 173 points
       Date   : 2021-04-13 11:49 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mileycyrusXOXO wrote:
       | The example is very impressive! Sounds very natural.
        
       | banana_giraffe wrote:
       | It's a cool demo. Though, to my ears it's still a bit far from my
       | dream of having something cheap or free I can feed more niche
       | books I like and use to create an audiobook version of them.
       | 
       | https://vocaroo.com/1oOjiLNCagur
        
         | Causality1 wrote:
         | That's still my personal TTS dream as well. Google's Read Aloud
         | voice blows everything else out of the water but I've found by
         | experimentation that it will only read the first three and a
         | half hours of web page text.
        
       | [deleted]
        
       | Abishek_Muthian wrote:
       | TTS tech is so accessible that product developers should consider
       | integrating it within their product for the sake of visually
       | impaired and not leave the product to the mercy of operating
       | system's accessibility features.
       | 
       | I feel the biggest causality of online advertisement is
       | accessibility, Those with eyes (eye-sight) are more valuable to
       | the mega corps than those without and so Internet is full of rich
       | graphics; making the lives of those without proper vision
       | miserable.
        
         | suyash wrote:
         | Always be 100% compatible with OS provided accessibility
         | guidelines before adding additional support. Most disabled
         | users are familiar with native accessibility tools and that
         | should come first.
        
       | mnemotronic wrote:
       | The acronym would be more fun if the product was Text-InTo-
       | Speech. Yea.... -1
        
       | ancarda wrote:
       | Will there be a wide choice of accents? The link in the README
       | <https://erogol.github.io/ddc-samples/> seemed to only have a
       | single voice
        
         | erogol wrote:
         | Yep. The aim is to solve TTS for all languages one at a time.
         | 
         | You can check out the released models page for the other models
         | and languages.
         | 
         | https://github.com/coqui-ai/TTS/releases
        
           | ftyers wrote:
           | When are you going to do Chuvash ? ;)
        
       | sandreas wrote:
       | Maybe interesting:
       | 
       | https://colab.research.google.com/drive/1SPl226SwzrfMZltrVag...
       | 
       | https://github.com/keithito/tacotron
       | 
       | https://www.youtube.com/watch?v=ijhZR43TOwc
       | 
       | https://heartbeat.fritz.ai/a-2019-guide-to-speech-synthesis-...
        
       | gxqoz wrote:
       | Hrmm the link to an example from Pocket leads me to hope that
       | these are coming to that app. The current TTS for listening to
       | saved articles is decent but certainly not state of the art.
        
       | Raed667 wrote:
       | Is there a way to get this working in Firefox?
        
       | Isn0gud wrote:
       | It seems like this is another dead mozilla project now, given
       | that the people who worked on this started a new project:
       | https://github.com/coqui-ai
        
         | echelon wrote:
         | I work on TTS (created https://vo.codes) and my impression of
         | the Mozilla project was that it was incredibly understaffed.
         | Unrealistically so to ever lead to any kind of product or
         | platform.
         | 
         | Maybe this new organization can accomplish the goal of easy and
         | open trainable TTS. I'd really like to see it.
        
         | kdavis wrote:
         | You can see some Coqui[0] TTS examples here[1].
         | 
         | [0] https://coqui.ai/
         | 
         | [1] https://erogol.github.io/ddc-samples/
        
       | erogol wrote:
       | Check out Coqui TTS where we continue the work.
       | 
       | https://github.com/coqui-ai/TTS
       | 
       | Mozilla TTS is not maintained anymore (at least ATM).
       | 
       | Disclaimer: I've created both of the projects.
        
       | adkadskhj wrote:
       | In an example[1], it sounds decent but i noticed a fuzzy white
       | noise whenever the voice is talking. Is this the algorithm, or
       | compression? If it's the algorithm, why?
       | 
       | [1]: https://soundcloud.com/user-565970875/pocket-article-
       | wavernn...
        
         | throwawaysea wrote:
         | I actually don't hear the fuzzy white noise, but maybe it's
         | because of my tinnitus. Is it during a certain part of the
         | recording? To my ears this sounds surprisingly high fidelity
         | and natural sounding.
        
           | adkadskhj wrote:
           | It's only during when the .. "person" talks. Which makes it
           | quite noticeable to me because it starts and stops. It is
           | rather faint, so i might not even notice it if it was
           | consistent.
        
         | erogol wrote:
         | It mainly reflects the quality of the trained dataset, the
         | earlier stages of the project and some experiments.
         | 
         | I suggest you the check the latest uploads on soundcloud.
        
         | xcodevn wrote:
         | This is a well known problem. The noise is due to mu-law
         | compression. The 16 bit audio samples are compressed to 8, 9,
         | or 10 bits before feeding to the neutral net. The reason is
         | because predicting a categorical distribution of 2^16 values
         | requires too many parameters. The noise was also in samples
         | from the famous Wavenet from Deepmind (they used 8 bit mu law).
         | 
         | There are two ways to avoid this: 1. predict 8 high (coarse)
         | bits, 8 low (fine) bits separately as in the original waveRNN
         | paper. 2. use a mixture of logistic distributions as the
         | predictive output as in the recent Lyra vocoder from Google.
        
           | Tade0 wrote:
           | How does the number of parameters scale with resolution?
           | 
           | Specifically, how much slower this would be if the audio was,
           | say, 10 bits?
           | 
           | I recall a lab exercise in college where we were supposed to
           | increase the resolution of a quantizer until we reached a
           | decent tone and 10 bits were the point at which we reached
           | satisfying quality.
        
             | xcodevn wrote:
             | It is a single matrix multiplication to predict
             | probabilities of all possible outputs. For example, with a
             | hidden state of 1024 dimensions, and 8 bits output, it is
             | 1024x256 parameters. 10 bits will need 1024x1024 params.
        
         | eddyg wrote:
         | I hear it as well, even when using the speaker on my phone and
         | not headphones (where it seems like it would be even more
         | noticeable).
        
       | marcodiego wrote:
       | FLOSS TTS and STT is badly needed right now. Being able to use
       | voice recognition and speech synthesis should not be restricted
       | to a small oligoply.
        
         | synesthesiam wrote:
         | Shameless plug for Rhasspy:
         | https://rhasspy.readthedocs.io/en/latest/
        
       | monkeydust wrote:
       | One of my products involves providing a lot of dense data to
       | traders overlayed with performance measures based on proprietary
       | models.
       | 
       | We are working on automatically extracting some insights for the
       | user and using NLP to present them like news articles.
       | 
       | It wouldn't take a huge lift from that to use TTS to provide
       | another way for user to digest the data.
       | 
       | Would make for a cool demo but wonder how sticky it would be.
        
       | cromwellian wrote:
       | NVidia pimped this at GTC21 as "state of the art TTS" which is
       | why I think it's getting renewed attention, , but to my ears, it
       | doesn't sound anywhere near WaveNet (Google), Siri, or Alexa.
        
         | swiley wrote:
         | I'm personally very suspicious of any software coming from
         | NVidia at this point.
        
         | [deleted]
        
       | uniqueid wrote:
       | What's the plan with this? Is it to incorporate it into Firefox
       | to improve its Web Speech API implementation?
        
         | hjek wrote:
         | I hope so. The examples sounds so much better than Espeak.
         | 
         | Edit: Oh, I see this project _uses_ Espeak. Interesting.
        
           | [deleted]
        
       | synesthesiam wrote:
       | Larynx TTS has a similar goal: https://rhasspy.github.io/larynx/
       | 
       | It was originally based on Mozilla TTS, but I've since moved to
       | exporting models to Onnx for speed.
        
       ___________________________________________________________________
       (page generated 2021-04-14 23:00 UTC)