[HN Gopher] Voder Speech Synthesizer
       ___________________________________________________________________
        
       Voder Speech Synthesizer
        
       Author : CyborgCabbage
       Score  : 217 points
       Date   : 2023-07-18 12:29 UTC (10 hours ago)
        
 (HTM) web link (griffin.moe)
 (TXT) w3m dump (griffin.moe)
        
       | mwcampbell wrote:
       | I first heard the Voder as the first sample on the Klatt Record
       | [1]. Unfortunately, there it's credited solely to Homer Dudley;
       | neither Bell Telephone Laboratory nor women like Helen Harper who
       | operated the machine were mentioned.
       | 
       | [1]: http://www.festvox.org/history/klatt.html
        
         | henjodottech wrote:
         | Odd indeed not to mention the artist playing the sample. Also
         | odd that OP article about the instrument does not mention the
         | inventor.
        
       | chaosprint wrote:
       | Interesting! thanks for sharing.
       | 
       | I have added this to my feature list for https://glicol.org
       | 
       | the source code looks fairly straightforward. very cool
       | 
       | ```js function makeFormantNode(ctx, f1, f2) { const sinOsc =
       | ctx.createOscillator(); sinOsc.type = 'sawtooth';
       | sinOsc.frequency.value = 110; sinOsc.start();
       | const bandPass = ctx.createBiquadFilter();       bandPass.type =
       | 'bandpass';       bandPass.frequency.value = (f1 + f2) / 2;
       | bandPass.Q.value = ((f1 + f2) / 2) / (f2 - f1);            const
       | gainNode = ctx.createGain();       gainNode.gain.value = 0.0;
       | sinOsc.connect(bandPass);       bandPass.connect(gainNode);
       | gainNode.connect(ctx.destination);            return {
       | start() {           gainNode.gain.setTargetAtTime(0.75,
       | ctx.currentTime, 0.015);         },         stop() {
       | gainNode.gain.setTargetAtTime(0.0, ctx.currentTime, 0.015);
       | },         panic() {
       | gainNode.gain.cancelScheduledValues(ctx.currentTime);
       | gainNode.gain.setTargetAtTime(0, ctx.currentTime, 0.015);
       | },       };
       | 
       | }
       | 
       | function makeSibilanceNode(ctx) { const buffer =
       | ctx.createBuffer(1, NOISE_BUFFER_SIZE, ctx.sampleRate); const
       | data = buffer.getChannelData(0); for (let i = 0; i <
       | NOISE_BUFFER_SIZE; ++i) { data[i] = Math.random(); }
       | const noise = ctx.createBufferSource();       noise.buffer =
       | buffer;       noise.loop = true;            const noiseFilter =
       | ctx.createBiquadFilter();       noiseFilter.type = 'bandpass';
       | noiseFilter.frequency.value = 5000;       noiseFilter.Q.value =
       | 0.5;            const noiseGain = ctx.createGain();
       | noiseGain.gain.value = 0.0;
       | noise.connect(noiseFilter);       noiseFilter.connect(noiseGain);
       | noiseGain.connect(ctx.destination);       noise.start();
       | return {         start() {
       | noiseGain.gain.setTargetAtTime(0.75, ctx.currentTime, 0.015);
       | },         stop() {           noiseGain.gain.setTargetAtTime(0.0,
       | ctx.currentTime, 0.015);         },         panic() {
       | noiseGain.gain.cancelScheduledValues(ctx.currentTime);
       | noiseGain.gain.setTargetAtTime(0, ctx.currentTime, 0.015);
       | },       };
       | 
       | }
       | 
       | function initialize() { audioCtx = new (window.AudioContext ||
       | window.webkitAudioContext)(); audioNodes['a'] =
       | makeFormantNode(audioCtx, 0, 225); audioNodes['s'] =
       | makeFormantNode(audioCtx, 225, 450); audioNodes['d'] =
       | makeFormantNode(audioCtx, 450, 700); audioNodes['f'] =
       | makeFormantNode(audioCtx, 700, 1000); audioNodes['v'] =
       | makeFormantNode(audioCtx, 1000, 1400); audioNodes['b'] =
       | makeFormantNode(audioCtx, 1400, 2000); audioNodes['h'] =
       | makeFormantNode(audioCtx, 2000, 2700); audioNodes['j'] =
       | makeFormantNode(audioCtx, 2700, 3800); audioNodes['k'] =
       | makeFormantNode(audioCtx, 3800, 5400); audioNodes['l'] =
       | makeFormantNode(audioCtx, 5400, 7500); audioNodes[' '] =
       | makeSibilanceNode(audioCtx); } ```
        
         | hahamrfunnyguy wrote:
         | Just checked out GLICOL. It's quite cool! Is there MIDI support
         | or any plans to add it?
        
       | jcpst wrote:
       | I have wanted to try one of these- the playable soft-synth is
       | great
        
       | kypro wrote:
       | This is quite off topic, but it reminded me of something I have
       | been thinking about recently - perhaps at the limit all highly
       | capable narrow AI systems must become generally intelligent.
       | 
       | I was thinking about the complexity of expression in TTS voice
       | synthesizers recently and it struck me just how difficult a
       | problem that is.
       | 
       | To be as expressive as a human the AI model would need to fully
       | "understand" the context of what is being said. Consider how a
       | phrase like "I hate you" can be said in a loving way between
       | friends sharing a joke at each others expense, vs being said with
       | anger or in sadness.
       | 
       | It got me wondering if all sufficiently complex problems require
       | models to be generally intelligent - at least in the sense that
       | they have deep, nuanced models of the world.
       | 
       | For example, perhaps for a self-driving car to be as "good" as a
       | human it actually needs to generally intelligent in that it needs
       | to understand that it's appropriate to drive differently if it is
       | in an emergency situation vs a leisurely weekend drive through a
       | scenic part of town. When driving through my city after 8PM on
       | the weekend I tend to drive slower and more cautiously because I
       | know drunk people often walk out in front for my car - would a
       | good self-driving car not need to understand these nuances of the
       | world too?
       | 
       | This is interesting because it highlights just how important the
       | element human understanding is in to accurately convey expression
       | in a voice synthesizer. While I'd argue modern voice synthesizers
       | have been more intelligible than this for some time the
       | expressiveness of this machine has probably only been recently
       | been rivalled by state of the art AI models.
        
         | mietek wrote:
         | Yes. In Iain M. Banks's Culture, even the guns are generally
         | intelligent.
        
         | progbits wrote:
         | Probably to some degree, but for your two examples I would
         | argue that isn't necessary:
         | 
         | For TTS, the "tone" is something you should encode in the input
         | rather than have TTS figure out. I can imagine ebook > LLM >
         | annotated text with speakers, emotions etc > TTS. So the TTS
         | can remain rather dumb.
         | 
         | For the self-driving car, it shouldn't know cultural norms and
         | be "more careful" sometimes. It should always know how much it
         | sees and what stoping distance it can get with max breaking and
         | its reaction time and adjust accordingly.
         | 
         | Agreed on stuff like emergencies etc.
        
           | kypro wrote:
           | > For the self-driving car, it shouldn't know cultural norms
           | and be "more careful" sometimes. It should always know how
           | much it sees and what stoping distance it can get with max
           | breaking and its reaction time and adjust accordingly.
           | 
           | I used to live next to two schools. In the morning before
           | school the pavement and road outside my house was always full
           | of school kids on bikes. During this time I'd drive with the
           | assumption that at any moment a bike could drive out in front
           | of my car because those kids were nuts and often did.
           | 
           | But to assume this generally just to be safe would be
           | extremely inconvenient. In reality if I see a group of bikers
           | wearing lycra I will assume their competent bikers. While
           | I'll still drive carefully, I won't assume they're about to
           | pull out in front of my car.
           | 
           | If self driving cars operate with the assumption that every
           | pedestrian is drunk and every bike on the road is a 12 year
           | school boy then no one will use them. Do self driving cars
           | try to this currently? If I jaywalked in front of a Tesla is
           | it designed to always be able to stop in time?
        
             | marcosdumay wrote:
             | I'd expect self driving cars to have much better sensors
             | and reaction times than we do, and as a consequence not
             | needing to choose between those risks and actually carrying
             | people from one point to another.
             | 
             | But they will probably be way slower than people on streets
             | that are just at the side of sidewalks and full of
             | pedestrians.
        
               | Gordonjcp wrote:
               | > I'd expect self driving cars to have much better
               | sensors and reaction times than we do
               | 
               | That is never going to happen.
        
         | BizarroLand wrote:
         | I think our current gen AI is only 1 piece of the puzzle.
         | 
         | This gen understands how to put words together to satisfy its
         | internal requirement to please the instruction it is given, but
         | it has no volition of its own and no drive it arrived at of its
         | own cognition.
         | 
         | I believe GAI will need to have multiple current gen systems
         | running simultaneously, (in unison if not in harmony) simply to
         | form a subconscious layer that a truly next gen AI would then
         | pick and choose from.
        
       | colanderman wrote:
       | The recently released Soma Terra synthesizer contains a key-per-
       | formant synthesis mode which operates like the Voder:
       | https://somasynths.com/terra/ (Ctrl+F "Voder" in the manual)
        
       | bsza wrote:
       | Wolfgang von Kempelen (creator of the fake chess automaton known
       | as the Turk) made a similar thing in the 18th century. [0] It had
       | multiple reeds tuned to the same frequency - conceptually similar
       | to the Voder. It might not be coincidence that Bell Labs
       | developed this, given that Bell himself had also made attempts to
       | improve the design, which is how he ended up inventing the
       | telephone.
       | 
       | [0]
       | https://en.wikipedia.org/wiki/Wolfgang_von_Kempelen%27s_spea...
        
       | bradrn wrote:
       | A short explanation as to how this works:
       | 
       | The voice can be modeled using two main components. The vocal
       | chords are a periodic source of sound, which is then filtered by
       | the mouth and tongue to produce vowel sounds [0]. The filter can
       | be modeled as a set of band-pass filters, each of which let
       | through a specific band of frequencies -- these are called
       | 'formants' in acoustic phonetics. Different vowel sounds are
       | produced by combining formants at different pitches in a
       | systematic way [1]. You can hear this yourself by very slowly
       | moving your mouth from saying an 'eeeee' sound to an 'ooooo'
       | sound: if you listen carefully, you can hear one formant changing
       | pitch while the others stay the same. (I like [2] as an intro to
       | this kind of stuff.)
       | 
       | The 'voder' works by having one key for each possible frequency
       | band-pass filter. Pressing multiple keys adds the resulting
       | sounds, producing an output sound with distinct formants. If you
       | use the right formants, the resulting sound is very similar to
       | that produced by a human mouth saying a specific vowel! Software
       | such as the vowel editor in Praat [3] take it further, by
       | allowing selection of formants from a standard vowel chart.
       | 
       | [0] Consonantal sounds are a bit more complicated, since they
       | tend to involve various different noise sources and transient
       | disturbances of the sound. For instance, /S/ (the 'sh' sound) is
       | noise of a lower frequency than /s/. I can't work out how Harper
       | produced the difference between those two sounds in the video --
       | it seems to be impossible to do this with the live demo. In fact,
       | any sort of pitch control seems to be impossible in the demo.
       | 
       | [1] This is how overtone singing and throat singing works!
       | Selectively amplifying one formant gives the impression that
       | you're singing that note as the same time as the 'base' pitch. In
       | fact, if you do that, your vocal cords are producing a pitch plus
       | all its overtones, while your mouth is enhancing one overtone
       | while filtering out all the others.
       | 
       | [2] https://newt.phys.unsw.edu.au/jw/voice.html
       | 
       | [3] https://www.fon.hum.uva.nl/praat/ -- probably also available
       | from your favourite Linux distro!
        
         | stavros wrote:
         | Apparently the Voder had a pitch pedal:
         | 
         | https://imgz.org/i9TzhzWu/
        
           | bradrn wrote:
           | Ah, that would explain it. Thanks for finding that image!
        
             | stavros wrote:
             | No problem! It's from a video linked in a thread below (the
             | extended World's Fair presentation).
        
         | luguenth wrote:
         | There's also a very nice simulation, where you can play with
         | the very different parts of vocal chords:
         | 
         | https://imaginary.github.io/pink-trombone/
        
           | jmiskovic wrote:
           | I made a fork with few more features, it might even work on
           | your phone browser:
           | 
           | https://jmiskovic.github.io/voicebox
        
             | civilitty wrote:
             | Thank you for this. I had a lot of fun scaring my cat in
             | bed and it inspired me to become a late middle aged opera
             | savant.
        
               | tomcam wrote:
               | I actually am a late middle-aged opera savant but sadly I
               | have no cat to scare
        
       | 0xFFOOFF wrote:
       | [dead]
        
       | JKCalhoun wrote:
       | I was skeptical that you could even _type_ an intelligible
       | phonetic  "She saw me" with only two phonemes let alone give it
       | the rise and fall demonstrated.
       | 
       | I've played with the SP0256 speech synthesis IC and found
       | constructing intelligible words challenging even with all the
       | phonemes available on that silicon.
       | 
       | This extended video has me thinking it probably was legit though:
       | 
       | https://youtu.be/TsdOej_nC1M
        
       | slmnsmk wrote:
       | Re: the author of that -
       | 
       | Hey I know the person who made this!
       | 
       | Thanks for sharing, it really was a labor of love. I remember
       | Griffin being super excited about how it turned out. They are
       | really passionate about the worlds fair!
        
       | jvm___ wrote:
       | Someone was selling a vocoder on eBay, so they made a video of
       | the vocoder describing it's own selling features.
       | 
       | https://www.youtube.com/watch?v=5kc-bhOOLxE
        
         | beingforthebene wrote:
         | I may be wrong, but a vocoder is entirely different than a
         | Voder.
        
       | lacrimacida wrote:
       | The intonation is very good in a way that modern speech
       | synthesizers don't get quite right.
        
         | p-e-w wrote:
         | What do you mean? Text-to-speech systems from the past few
         | years are indistinguishable from an actual human voice.
        
           | lacrimacida wrote:
           | I agree that they got pretty good but there's still something
           | that they get wrong, their intonation is a kind of passable
           | average. If you want to be able to distinguish them from
           | actual human speech pay close attention to
           | intonation/inflection. They're still very usable, Im not
           | claiming otherwise
        
             | stavros wrote:
             | I can't hear much difference in the Studio voice:
             | 
             | https://cloud.google.com/text-to-speech/docs/wavenet
             | 
             | I'm fairly sure I couldn't tell Studio voices and real
             | people apart in a blind test.
        
               | id0ntw4ntit wrote:
               | It would be a good test. I don't think they are yet
               | indistinguishable, for what it's worth.
        
       | JoeDaDude wrote:
       | I've been interested in how these were actually played. If anyone
       | has access to the material used to train the operators, I'd love
       | to hear about it.
       | 
       | BTW, there was one fellow who built one, something I'd like to
       | try someday. See his recreation here:
       | 
       | https://www.youtube.com/watch?v=gv9m0Z7mhXY
        
       | joezydeco wrote:
       | The Voder was part of a much larger Bell Labs project, one that
       | eventually developed into one of the first unbreakable encrypted
       | telephony systems used in World War II.
       | 
       | https://99percentinvisible.org/episode/vox-ex-machina/
        
       | zzzeek wrote:
       | only a woman could operate the machine yet it was built to create
       | a man's voice
        
         | jlnho wrote:
         | [flagged]
        
       | lbriner wrote:
       | Something somebody told me was that it seems really amazing but
       | without the host prompting the listener as to the phrase, "she
       | saw me", most of the time you wouldn't know what it was saying.
       | 
       | I heard a sample of "Say, good afternoon radio audience", then
       | the Voder produces something very similar, but listen to it
       | without the prompt and you would have to guess what it meant.
       | 
       | A Derren Brown kind of trick :-)
        
       | userbinator wrote:
       | Another vocal-tract-model synth that showed up on HN a while ago:
       | https://news.ycombinator.com/item?id=18912628
        
       ___________________________________________________________________
       (page generated 2023-07-18 23:00 UTC)