[HN Gopher] Hertz-dev, the first open-source base model for conv...
___________________________________________________________________
Hertz-dev, the first open-source base model for conversational
audio
Author : mnk47
Score : 269 points
Date : 2024-11-03 23:30 UTC (23 hours ago)
(HTM) web link (si.inc)
(TXT) w3m dump (si.inc)
| mnk47 wrote:
| Repo: https://github.com/Standard-Intelligence/hertz-dev
| wg0 wrote:
| So it is kind of LLM but audio LLM where prompt is audio and
| generated output is audio too?
| ryukoposting wrote:
| Yes, as far as I can tell that's exactly what's happening.
| lordofgibbons wrote:
| Can it effectively be used as a TTS model?
| Tepix wrote:
| It doesn't know about text.
| BrandiATMuhkuh wrote:
| That's really cool. I'm currently exploring VUI (Voice User
| Interface) and this might come in handy.
|
| I might be a bit biased (did my PhD exploring how VUI can
| persuade humans), but I think VUI is "the future" of computer
| interaction. If it's not the future, than at least it adds a new
| group of people (kids + elderly people) as potential users.
| wwwlouishinofun wrote:
| yes, there are blind people
| wwwlouishinofun wrote:
| Tesla's approach to pure vision-based autonomous driving--
| temporarily setting aside lidar and other sensors--seems designed
| to make this technology more accessible and scalable. By focusing
| on a vision-only model, they can accelerate adoption and gather
| large datasets for quicker iterations. Once the vision-based
| system reaches a mature stage, I imagine Tesla might reintegrate
| additional sensor data, like lidar or radar, to refine their
| autonomous driving suite, making it even more robust and closer
| to perfection.
|
| Additionally, I've been exploring an idea about voice interaction
| systems. Currently, most voice interactions are processed by
| converting voice input into text, generating a text-based
| response, and then turning this text back into audio. But what if
| we could train the system to respond directly in voice, without
| involving text at all? If developed to maturity, this model could
| produce responses that feel more natural and spontaneous,
| possibly diverging from traditional text-to-speech outputs.
| Natural speech has unique syntax and rhythm, not to mention
| dialect and tone variations, which could make a purely voice-
| trained system fascinating and more human-like.
|
| Could you let me know if your current voice interaction model
| follows the standard speech-to-text-to-speech process, or if
| there is exploration in voice-to-voice processing?
| nicholas-cc wrote:
| I'm one of the devs. Our model is fully voice-to-voice, no text
| was involved in the making of hertz-dev for exactly this
| reason.
| oidar wrote:
| So essentially this is voice input to voice output? Can you
| change gender/age/accent? Does it track prosodic information?
| I've been waiting for something like this.
| nicholas-cc wrote:
| Hertz-dev is a base model, meaning it's just trained to
| predict the next token of audio. If your prompt is an old
| male voice with a British accent, the model will most
| likely continue speaking in an old male voice with a
| British accent. Being a base model, hertz-dev is easily
| finetunable for specific tasks - it would be a simple
| change to add manual configurations for the
| gender/age/accent.
| hunter2_ wrote:
| I assume this mirroring is due to symmetry being more
| typical than not among the training data, and if instead
| trained with contrived diversity (e.g., males only
| conversing with females) then the output of the base
| model would follow suit without pulling any levers?
|
| It's interesting to think about what complete diversity
| (i.e., no tendencies toward homogeneous conversation
| partners whatsoever among training data) would yield,
| given that it's trying to deliver whatever is most
| probable.
| modeless wrote:
| I'm interested to hear more detail about approaches to
| adding manual controls for speaker characteristics or
| emotion or other things you might want to vary. What
| techniques do you have in mind?
| vessenes wrote:
| I'll jump in here - as a former new englander, the
| cheerful helping tone of all modern voice llms infuriates
| me. And the slow speed. And the over explanations.
| ChatGPT advanced can be induced to talk more quickly,
| less sycophantically and if I like in a not-bad regional
| accent; essentially I want it to mirror my tone better.
| But those inducements don't stick between sessions.
|
| On the technical side having some sort of continuation or
| summarization loop on seems interesting to me as a
| product feature. It's not enough to build a company off
| of though. But it would be nice.
| wwwlouishinofun wrote:
| Oh, you have completed the project I planned. Currently, do
| you think the difficulty in improving the model lies in voice
| data, computing power, or algorithm optimization? I
| personally think that if you want to achieve the ultimate,
| you don't need to remove the background sound from the
| original audio. Outputting audio mixed with background sound
| as new audio may result in background music,
|
| If you use completely unprocessed speech data (including
| speech information with background music on YouTube), I think
| the potential will be higher, but the requirements on your
| computing power are too high. If you don't have money to buy
| a GPU, just use voice noise reduction processing first.
| vanviegen wrote:
| I think you're describing ChatGPT Advanced Voice Mode (or
| Realtime API) in your second paragraph.
| throwaway314155 wrote:
| They were so busy inventing they forgot to do a basic Google
| search to see if it had already been done.
| Dawny33 wrote:
| Congrats, team.
|
| Does Hertz support multi-lingual audio right now?
| nicholas-cc wrote:
| Yes
| wwwlouishinofun wrote:
| I'm going to try it in Chinese
| awinter-py wrote:
| what is up with the first sample? and/or am I having a stroke
| spuz wrote:
| Pay attention to the given prompt length in the examples. The
| first 2 seconds of the first example is a real human speaking.
| Everything after is generated by the model. It produces what
| almost sounds like real human speech mimicking the voice of the
| input but it's currently at a level of something like GPT-2 in
| terms of meaningful words.
| reissbaker wrote:
| This is really cool. FWIW, existing open-source TTS engines are
| _really_ bad in comparison to what you have here: I know this is
| voice-to-voice, but I think there 'd be a lot of appetite to get
| this to also be multimodal and accept text (essentially making it
| a really good TTS model, in addition to a great voice-to-voice
| model).
|
| I suppose someone could hack their way around the problem by
| finetuning it to essentially replay Piper (or whatever) output,
| only with more natural prosody and intonation. And then have the
| text LLM pipe to Piper, and Piper pipe to Hertz-dev. But it would
| be pretty useful to have it accept text natively!
| netdevnet wrote:
| They are a team of 4. At that size, it's better for them to be
| focused on one thing than stretched out
| reissbaker wrote:
| Eh, that depends. A small model that's voice-and-text is
| probably more useful to most people than scaling up a voice-
| only model: the large voice-only model will have to compete
| on intelligence with e.g. Qwen and Llama, since it can't be
| used in conjunction with them; whereas a small voice+text
| model can be used as a cheap frontend hiding a larger,
| smarter, but more expensive text-only model behind it. This
| is an 8b model: running it is nearly free, it can fit on a
| 4090 with room to spare.
|
| On the one hand, a small team focused on voice-to-voice could
| probably do a lot better at voice-to-voice than a small team
| focused on voice-to-voice+text. But a small team focused on
| making the most useful model would probably do better at that
| goal by focusing on voice+text rather than voice-only.
| netdevnet wrote:
| Their goal is not working on what's most useful for most
| people though. That's the domain of the big AI players.
| They are small and so specialising works best as that's
| where they can have an edge as a company.
|
| At the end of the day, the released product needs to be
| good and needs to be done in a reasonable amount of time. I
| highly doubt they can do a generic model as well as a more
| specialised one.
|
| But if you think you know better than them, you could try
| to contact them even though it looks they are crazy laser
| focused (their public email addresses are either for
| investors or employee candidates).
| PenisBanana wrote:
| Yes, yes. This. Piper is already pretty good . . . and then
| this.
|
| It may not be _them_ doing it, though.
| blixt wrote:
| Gotta say I was confused for a second but yeah apparently si.inc
| and ssi.inc are the domains for two different AGI companies and I
| can only assume it's intentional?
| imjonse wrote:
| According to whois records si.inc was registered 5 days after
| ssi.inc in June. So yes, maybe intentional.
| programjames wrote:
| But the company si.inc (Standard Intelligence) was founded
| many months before ssi.inc (Safe Superintelligence), so they
| likely just didn't want their domain name to get taken.
| blixt wrote:
| They say Hertz is first of its kind but Moshi is another duplex
| audio model from earlier this year that seems to perform
| similarly (and it runs on a MacBook): https://github.com/kyutai-
| labs/moshi
| nicholas-cc wrote:
| Moshi is a good model to build chat applications on, this is
| designed to be more of a proper base model with all the
| quirkiness, naturalness, and researcher-friendliness of base
| modeling.
| a2128 wrote:
| Moshi never released the base model, only two conversationally
| finetuned models. They also never released training code except
| for the codec. Though I don't see any training code for Hertz
| either, just 3 inference notebooks, and model code full of
| no_grad. No paper either to help me understand how this was
| trained and what the architecture is like. So I'm not too sure
| about researcher-friendliness unless I'm missing something.
| nicholas-cc wrote:
| We're working on a HuggingFace release that will help with
| finetuning. We'd like to do a paper, after a larger release -
| we're a team of 4.
| netdevnet wrote:
| Very impressive for just 4 people. What's the team
| background and how long have you been working on this?
| unit149 wrote:
| For a rag-tag group of transcendental audiophiles
| operating electronic circuitry, it ionizes and atomizes
| well.
| programjames wrote:
| I'm not part of their team, but lived with them for a
| couple months. They've been working on it for ~5 months,
| and their background is 16-20 year olds who are too smart
| for university.
| underlines wrote:
| - LLaMA-Omni https://github.com/ictnlp/LLaMA-Omni a speech-
| language model built on Llama-3.1-8B-Instruct for simultaneous
| generation of text and speech
|
| - moshi https://github.com/kyutai-labs/moshi speech-text
| foundation model using Mimi, a SOTA streaming neural audio
| codec
|
| - Mini-Omni https://github.com/gpt-omni/mini-omni multimodal
| LLM based on Qwen2 offering speech input and output
|
| - Ichigo https://github.com/homebrewltd/ichigo open research
| project extending a text-based LLM to have native listening
| ability, using an early fusion technique
| xarope wrote:
| the One-channel generation seems to be speaking gibberish
| english. I'm not sure what it is supposed to represent?
|
| And is the interactive generation just doing an ELIZA? i.e. "P:
| tell us about how AI will be interesting", "A: Yeah AI will,
| yeah, be interesting".
| briansm wrote:
| The codec parameters remind me of the ~300bps NRV military speech
| codec from 2010. It also uses 120ms (8hz) frames, vbr encoded
| using 16KHz audio (closed source though).
|
| https://ieeexplore.ieee.org/document/5680311
| codedokode wrote:
| The voice sounds a little bit distorted, and there is often a
| noise in the background (especially noticeable when this noise
| disappears when the voice pauses). I wonder, is it model
| limitations or is it the problem with quality of training data?
| jcims wrote:
| If the authors or anyone else that works on a voice model are in
| here, do you ever get creeped out or feel the sounds you're
| getting from the system have a physiological effect on you?
| ryukoposting wrote:
| The voice samples are speaking gibberish a lot of the time, but
| sonically the voices are fantastic. They sound _human_ , even if
| it's nonsense syllables.
|
| With SD and LLMs, there's a lot you can do to debug it by
| studying the way it responds to small changes in the prompt. But,
| since Hertz-dev is using sound as its input, it would be hard to
| discern which token you should tweak. Of course, if it's meant to
| be used in real time, that kind of fiddling isn't an option at
| all. How would you go about systematically studying Hertz-dev's
| behavior?
| Jayakumark wrote:
| What is the license on model weights ?
| kunley wrote:
| Anything more about the company, founders, affiliations..?
| ttul wrote:
| Some commits are by `nicholascc`
| (https://github.com/nicholascc); via Twitter, he seems to be
| Nicholas Charette. Nicholas is a first year student at
| Stanford. For such a young group, this is a really impressive
| effort!
| zachthewf wrote:
| Cool, looks like this is trained on 16 million hours of audio
| (500B tokens at ~.11 seconds per token).
|
| Even the large open source TTS models (see F5 TTS, Mask GCT) are
| mostly trained on very small audio datasets (say 100k hours)
| relative to the amount of audio available on the internet, so
| it's cool to see an open source effort to scale up training
| significantly.
| mazoza wrote:
| Can one of the authors explain what this actually means from the
| post?
|
| hertz-vae: a 1.8 billion parameter transformer decoder which acts
| as a learned prior for the audio VAE. The model uses a context of
| 8192 sampled latent representations (17 minutes) and predicts the
| next encoded audio frame as a mixture of gaussians. 15 bits of
| quantized information from the next token act as semantic
| scaffolding to steer the generation in a streamable manner.
| programjames wrote:
| My guess:
|
| 1. `codec`: First, compress 16k samplerate audio into 8 samples
| per second with convolutions. Then, vector quantize to 128 bits
| (probably 8 floats) to get a codec. This is not nearly enough
| bits to actually represent the audio, it's more to represent
| phenomes.
|
| 2. `vae` -> This looks like a VAE-based diffusion model, that
| uses the codec as its prompt.
|
| 3. `dev` -> This is a next-codec prediction model.
|
| Put together, it probably runs like so:
|
| 1. Turn your prompt into tokens with the `codec`.
|
| 2. If you want s more seconds of audio, use `dev` to predict 8
| * s more tokens.
|
| 3. Turn it back into audio with the `vae` diffusion model.
| m11a wrote:
| > Base models are uniquely valuable as a research product because
| they accurately model the distribution of the data that they were
| trained on, as opposed to models that have had substantial RL
| tuning done to collapse their generation distributions. This
| makes base models the best starting point to fine-tune for a
| large number of different tasks.
|
| Is this idea ('collapse of their generation distributions') a
| researched topic? If so, under what name?
|
| Sounds interesting and maybe related to the whole continual
| learning / how to finetune properly line of work
___________________________________________________________________
(page generated 2024-11-04 23:02 UTC)