[HN Gopher] Voder Speech Synthesizer
___________________________________________________________________
Voder Speech Synthesizer
Author : CyborgCabbage
Score : 217 points
Date : 2023-07-18 12:29 UTC (10 hours ago)
(HTM) web link (griffin.moe)
(TXT) w3m dump (griffin.moe)
| mwcampbell wrote:
| I first heard the Voder as the first sample on the Klatt Record
| [1]. Unfortunately, there it's credited solely to Homer Dudley;
| neither Bell Telephone Laboratory nor women like Helen Harper who
| operated the machine were mentioned.
|
| [1]: http://www.festvox.org/history/klatt.html
| henjodottech wrote:
| Odd indeed not to mention the artist playing the sample. Also
| odd that OP article about the instrument does not mention the
| inventor.
| chaosprint wrote:
| Interesting! thanks for sharing.
|
| I have added this to my feature list for https://glicol.org
|
| the source code looks fairly straightforward. very cool
|
| ```js function makeFormantNode(ctx, f1, f2) { const sinOsc =
| ctx.createOscillator(); sinOsc.type = 'sawtooth';
| sinOsc.frequency.value = 110; sinOsc.start();
| const bandPass = ctx.createBiquadFilter(); bandPass.type =
| 'bandpass'; bandPass.frequency.value = (f1 + f2) / 2;
| bandPass.Q.value = ((f1 + f2) / 2) / (f2 - f1); const
| gainNode = ctx.createGain(); gainNode.gain.value = 0.0;
| sinOsc.connect(bandPass); bandPass.connect(gainNode);
| gainNode.connect(ctx.destination); return {
| start() { gainNode.gain.setTargetAtTime(0.75,
| ctx.currentTime, 0.015); }, stop() {
| gainNode.gain.setTargetAtTime(0.0, ctx.currentTime, 0.015);
| }, panic() {
| gainNode.gain.cancelScheduledValues(ctx.currentTime);
| gainNode.gain.setTargetAtTime(0, ctx.currentTime, 0.015);
| }, };
|
| }
|
| function makeSibilanceNode(ctx) { const buffer =
| ctx.createBuffer(1, NOISE_BUFFER_SIZE, ctx.sampleRate); const
| data = buffer.getChannelData(0); for (let i = 0; i <
| NOISE_BUFFER_SIZE; ++i) { data[i] = Math.random(); }
| const noise = ctx.createBufferSource(); noise.buffer =
| buffer; noise.loop = true; const noiseFilter =
| ctx.createBiquadFilter(); noiseFilter.type = 'bandpass';
| noiseFilter.frequency.value = 5000; noiseFilter.Q.value =
| 0.5; const noiseGain = ctx.createGain();
| noiseGain.gain.value = 0.0;
| noise.connect(noiseFilter); noiseFilter.connect(noiseGain);
| noiseGain.connect(ctx.destination); noise.start();
| return { start() {
| noiseGain.gain.setTargetAtTime(0.75, ctx.currentTime, 0.015);
| }, stop() { noiseGain.gain.setTargetAtTime(0.0,
| ctx.currentTime, 0.015); }, panic() {
| noiseGain.gain.cancelScheduledValues(ctx.currentTime);
| noiseGain.gain.setTargetAtTime(0, ctx.currentTime, 0.015);
| }, };
|
| }
|
| function initialize() { audioCtx = new (window.AudioContext ||
| window.webkitAudioContext)(); audioNodes['a'] =
| makeFormantNode(audioCtx, 0, 225); audioNodes['s'] =
| makeFormantNode(audioCtx, 225, 450); audioNodes['d'] =
| makeFormantNode(audioCtx, 450, 700); audioNodes['f'] =
| makeFormantNode(audioCtx, 700, 1000); audioNodes['v'] =
| makeFormantNode(audioCtx, 1000, 1400); audioNodes['b'] =
| makeFormantNode(audioCtx, 1400, 2000); audioNodes['h'] =
| makeFormantNode(audioCtx, 2000, 2700); audioNodes['j'] =
| makeFormantNode(audioCtx, 2700, 3800); audioNodes['k'] =
| makeFormantNode(audioCtx, 3800, 5400); audioNodes['l'] =
| makeFormantNode(audioCtx, 5400, 7500); audioNodes[' '] =
| makeSibilanceNode(audioCtx); } ```
| hahamrfunnyguy wrote:
| Just checked out GLICOL. It's quite cool! Is there MIDI support
| or any plans to add it?
| jcpst wrote:
| I have wanted to try one of these- the playable soft-synth is
| great
| kypro wrote:
| This is quite off topic, but it reminded me of something I have
| been thinking about recently - perhaps at the limit all highly
| capable narrow AI systems must become generally intelligent.
|
| I was thinking about the complexity of expression in TTS voice
| synthesizers recently and it struck me just how difficult a
| problem that is.
|
| To be as expressive as a human the AI model would need to fully
| "understand" the context of what is being said. Consider how a
| phrase like "I hate you" can be said in a loving way between
| friends sharing a joke at each others expense, vs being said with
| anger or in sadness.
|
| It got me wondering if all sufficiently complex problems require
| models to be generally intelligent - at least in the sense that
| they have deep, nuanced models of the world.
|
| For example, perhaps for a self-driving car to be as "good" as a
| human it actually needs to generally intelligent in that it needs
| to understand that it's appropriate to drive differently if it is
| in an emergency situation vs a leisurely weekend drive through a
| scenic part of town. When driving through my city after 8PM on
| the weekend I tend to drive slower and more cautiously because I
| know drunk people often walk out in front for my car - would a
| good self-driving car not need to understand these nuances of the
| world too?
|
| This is interesting because it highlights just how important the
| element human understanding is in to accurately convey expression
| in a voice synthesizer. While I'd argue modern voice synthesizers
| have been more intelligible than this for some time the
| expressiveness of this machine has probably only been recently
| been rivalled by state of the art AI models.
| mietek wrote:
| Yes. In Iain M. Banks's Culture, even the guns are generally
| intelligent.
| progbits wrote:
| Probably to some degree, but for your two examples I would
| argue that isn't necessary:
|
| For TTS, the "tone" is something you should encode in the input
| rather than have TTS figure out. I can imagine ebook > LLM >
| annotated text with speakers, emotions etc > TTS. So the TTS
| can remain rather dumb.
|
| For the self-driving car, it shouldn't know cultural norms and
| be "more careful" sometimes. It should always know how much it
| sees and what stoping distance it can get with max breaking and
| its reaction time and adjust accordingly.
|
| Agreed on stuff like emergencies etc.
| kypro wrote:
| > For the self-driving car, it shouldn't know cultural norms
| and be "more careful" sometimes. It should always know how
| much it sees and what stoping distance it can get with max
| breaking and its reaction time and adjust accordingly.
|
| I used to live next to two schools. In the morning before
| school the pavement and road outside my house was always full
| of school kids on bikes. During this time I'd drive with the
| assumption that at any moment a bike could drive out in front
| of my car because those kids were nuts and often did.
|
| But to assume this generally just to be safe would be
| extremely inconvenient. In reality if I see a group of bikers
| wearing lycra I will assume their competent bikers. While
| I'll still drive carefully, I won't assume they're about to
| pull out in front of my car.
|
| If self driving cars operate with the assumption that every
| pedestrian is drunk and every bike on the road is a 12 year
| school boy then no one will use them. Do self driving cars
| try to this currently? If I jaywalked in front of a Tesla is
| it designed to always be able to stop in time?
| marcosdumay wrote:
| I'd expect self driving cars to have much better sensors
| and reaction times than we do, and as a consequence not
| needing to choose between those risks and actually carrying
| people from one point to another.
|
| But they will probably be way slower than people on streets
| that are just at the side of sidewalks and full of
| pedestrians.
| Gordonjcp wrote:
| > I'd expect self driving cars to have much better
| sensors and reaction times than we do
|
| That is never going to happen.
| BizarroLand wrote:
| I think our current gen AI is only 1 piece of the puzzle.
|
| This gen understands how to put words together to satisfy its
| internal requirement to please the instruction it is given, but
| it has no volition of its own and no drive it arrived at of its
| own cognition.
|
| I believe GAI will need to have multiple current gen systems
| running simultaneously, (in unison if not in harmony) simply to
| form a subconscious layer that a truly next gen AI would then
| pick and choose from.
| colanderman wrote:
| The recently released Soma Terra synthesizer contains a key-per-
| formant synthesis mode which operates like the Voder:
| https://somasynths.com/terra/ (Ctrl+F "Voder" in the manual)
| bsza wrote:
| Wolfgang von Kempelen (creator of the fake chess automaton known
| as the Turk) made a similar thing in the 18th century. [0] It had
| multiple reeds tuned to the same frequency - conceptually similar
| to the Voder. It might not be coincidence that Bell Labs
| developed this, given that Bell himself had also made attempts to
| improve the design, which is how he ended up inventing the
| telephone.
|
| [0]
| https://en.wikipedia.org/wiki/Wolfgang_von_Kempelen%27s_spea...
| bradrn wrote:
| A short explanation as to how this works:
|
| The voice can be modeled using two main components. The vocal
| chords are a periodic source of sound, which is then filtered by
| the mouth and tongue to produce vowel sounds [0]. The filter can
| be modeled as a set of band-pass filters, each of which let
| through a specific band of frequencies -- these are called
| 'formants' in acoustic phonetics. Different vowel sounds are
| produced by combining formants at different pitches in a
| systematic way [1]. You can hear this yourself by very slowly
| moving your mouth from saying an 'eeeee' sound to an 'ooooo'
| sound: if you listen carefully, you can hear one formant changing
| pitch while the others stay the same. (I like [2] as an intro to
| this kind of stuff.)
|
| The 'voder' works by having one key for each possible frequency
| band-pass filter. Pressing multiple keys adds the resulting
| sounds, producing an output sound with distinct formants. If you
| use the right formants, the resulting sound is very similar to
| that produced by a human mouth saying a specific vowel! Software
| such as the vowel editor in Praat [3] take it further, by
| allowing selection of formants from a standard vowel chart.
|
| [0] Consonantal sounds are a bit more complicated, since they
| tend to involve various different noise sources and transient
| disturbances of the sound. For instance, /S/ (the 'sh' sound) is
| noise of a lower frequency than /s/. I can't work out how Harper
| produced the difference between those two sounds in the video --
| it seems to be impossible to do this with the live demo. In fact,
| any sort of pitch control seems to be impossible in the demo.
|
| [1] This is how overtone singing and throat singing works!
| Selectively amplifying one formant gives the impression that
| you're singing that note as the same time as the 'base' pitch. In
| fact, if you do that, your vocal cords are producing a pitch plus
| all its overtones, while your mouth is enhancing one overtone
| while filtering out all the others.
|
| [2] https://newt.phys.unsw.edu.au/jw/voice.html
|
| [3] https://www.fon.hum.uva.nl/praat/ -- probably also available
| from your favourite Linux distro!
| stavros wrote:
| Apparently the Voder had a pitch pedal:
|
| https://imgz.org/i9TzhzWu/
| bradrn wrote:
| Ah, that would explain it. Thanks for finding that image!
| stavros wrote:
| No problem! It's from a video linked in a thread below (the
| extended World's Fair presentation).
| luguenth wrote:
| There's also a very nice simulation, where you can play with
| the very different parts of vocal chords:
|
| https://imaginary.github.io/pink-trombone/
| jmiskovic wrote:
| I made a fork with few more features, it might even work on
| your phone browser:
|
| https://jmiskovic.github.io/voicebox
| civilitty wrote:
| Thank you for this. I had a lot of fun scaring my cat in
| bed and it inspired me to become a late middle aged opera
| savant.
| tomcam wrote:
| I actually am a late middle-aged opera savant but sadly I
| have no cat to scare
| 0xFFOOFF wrote:
| [dead]
| JKCalhoun wrote:
| I was skeptical that you could even _type_ an intelligible
| phonetic "She saw me" with only two phonemes let alone give it
| the rise and fall demonstrated.
|
| I've played with the SP0256 speech synthesis IC and found
| constructing intelligible words challenging even with all the
| phonemes available on that silicon.
|
| This extended video has me thinking it probably was legit though:
|
| https://youtu.be/TsdOej_nC1M
| slmnsmk wrote:
| Re: the author of that -
|
| Hey I know the person who made this!
|
| Thanks for sharing, it really was a labor of love. I remember
| Griffin being super excited about how it turned out. They are
| really passionate about the worlds fair!
| jvm___ wrote:
| Someone was selling a vocoder on eBay, so they made a video of
| the vocoder describing it's own selling features.
|
| https://www.youtube.com/watch?v=5kc-bhOOLxE
| beingforthebene wrote:
| I may be wrong, but a vocoder is entirely different than a
| Voder.
| lacrimacida wrote:
| The intonation is very good in a way that modern speech
| synthesizers don't get quite right.
| p-e-w wrote:
| What do you mean? Text-to-speech systems from the past few
| years are indistinguishable from an actual human voice.
| lacrimacida wrote:
| I agree that they got pretty good but there's still something
| that they get wrong, their intonation is a kind of passable
| average. If you want to be able to distinguish them from
| actual human speech pay close attention to
| intonation/inflection. They're still very usable, Im not
| claiming otherwise
| stavros wrote:
| I can't hear much difference in the Studio voice:
|
| https://cloud.google.com/text-to-speech/docs/wavenet
|
| I'm fairly sure I couldn't tell Studio voices and real
| people apart in a blind test.
| id0ntw4ntit wrote:
| It would be a good test. I don't think they are yet
| indistinguishable, for what it's worth.
| JoeDaDude wrote:
| I've been interested in how these were actually played. If anyone
| has access to the material used to train the operators, I'd love
| to hear about it.
|
| BTW, there was one fellow who built one, something I'd like to
| try someday. See his recreation here:
|
| https://www.youtube.com/watch?v=gv9m0Z7mhXY
| joezydeco wrote:
| The Voder was part of a much larger Bell Labs project, one that
| eventually developed into one of the first unbreakable encrypted
| telephony systems used in World War II.
|
| https://99percentinvisible.org/episode/vox-ex-machina/
| zzzeek wrote:
| only a woman could operate the machine yet it was built to create
| a man's voice
| jlnho wrote:
| [flagged]
| lbriner wrote:
| Something somebody told me was that it seems really amazing but
| without the host prompting the listener as to the phrase, "she
| saw me", most of the time you wouldn't know what it was saying.
|
| I heard a sample of "Say, good afternoon radio audience", then
| the Voder produces something very similar, but listen to it
| without the prompt and you would have to guess what it meant.
|
| A Derren Brown kind of trick :-)
| userbinator wrote:
| Another vocal-tract-model synth that showed up on HN a while ago:
| https://news.ycombinator.com/item?id=18912628
___________________________________________________________________
(page generated 2023-07-18 23:00 UTC)