[HN Gopher] Hertz-dev, the first open-source base model for conv...
       ___________________________________________________________________
        
       Hertz-dev, the first open-source base model for conversational
       audio
        
       Author : mnk47
       Score  : 269 points
       Date   : 2024-11-03 23:30 UTC (23 hours ago)
        
 (HTM) web link (si.inc)
 (TXT) w3m dump (si.inc)
        
       | mnk47 wrote:
       | Repo: https://github.com/Standard-Intelligence/hertz-dev
        
       | wg0 wrote:
       | So it is kind of LLM but audio LLM where prompt is audio and
       | generated output is audio too?
        
         | ryukoposting wrote:
         | Yes, as far as I can tell that's exactly what's happening.
        
       | lordofgibbons wrote:
       | Can it effectively be used as a TTS model?
        
         | Tepix wrote:
         | It doesn't know about text.
        
       | BrandiATMuhkuh wrote:
       | That's really cool. I'm currently exploring VUI (Voice User
       | Interface) and this might come in handy.
       | 
       | I might be a bit biased (did my PhD exploring how VUI can
       | persuade humans), but I think VUI is "the future" of computer
       | interaction. If it's not the future, than at least it adds a new
       | group of people (kids + elderly people) as potential users.
        
         | wwwlouishinofun wrote:
         | yes, there are blind people
        
       | wwwlouishinofun wrote:
       | Tesla's approach to pure vision-based autonomous driving--
       | temporarily setting aside lidar and other sensors--seems designed
       | to make this technology more accessible and scalable. By focusing
       | on a vision-only model, they can accelerate adoption and gather
       | large datasets for quicker iterations. Once the vision-based
       | system reaches a mature stage, I imagine Tesla might reintegrate
       | additional sensor data, like lidar or radar, to refine their
       | autonomous driving suite, making it even more robust and closer
       | to perfection.
       | 
       | Additionally, I've been exploring an idea about voice interaction
       | systems. Currently, most voice interactions are processed by
       | converting voice input into text, generating a text-based
       | response, and then turning this text back into audio. But what if
       | we could train the system to respond directly in voice, without
       | involving text at all? If developed to maturity, this model could
       | produce responses that feel more natural and spontaneous,
       | possibly diverging from traditional text-to-speech outputs.
       | Natural speech has unique syntax and rhythm, not to mention
       | dialect and tone variations, which could make a purely voice-
       | trained system fascinating and more human-like.
       | 
       | Could you let me know if your current voice interaction model
       | follows the standard speech-to-text-to-speech process, or if
       | there is exploration in voice-to-voice processing?
        
         | nicholas-cc wrote:
         | I'm one of the devs. Our model is fully voice-to-voice, no text
         | was involved in the making of hertz-dev for exactly this
         | reason.
        
           | oidar wrote:
           | So essentially this is voice input to voice output? Can you
           | change gender/age/accent? Does it track prosodic information?
           | I've been waiting for something like this.
        
             | nicholas-cc wrote:
             | Hertz-dev is a base model, meaning it's just trained to
             | predict the next token of audio. If your prompt is an old
             | male voice with a British accent, the model will most
             | likely continue speaking in an old male voice with a
             | British accent. Being a base model, hertz-dev is easily
             | finetunable for specific tasks - it would be a simple
             | change to add manual configurations for the
             | gender/age/accent.
        
               | hunter2_ wrote:
               | I assume this mirroring is due to symmetry being more
               | typical than not among the training data, and if instead
               | trained with contrived diversity (e.g., males only
               | conversing with females) then the output of the base
               | model would follow suit without pulling any levers?
               | 
               | It's interesting to think about what complete diversity
               | (i.e., no tendencies toward homogeneous conversation
               | partners whatsoever among training data) would yield,
               | given that it's trying to deliver whatever is most
               | probable.
        
               | modeless wrote:
               | I'm interested to hear more detail about approaches to
               | adding manual controls for speaker characteristics or
               | emotion or other things you might want to vary. What
               | techniques do you have in mind?
        
               | vessenes wrote:
               | I'll jump in here - as a former new englander, the
               | cheerful helping tone of all modern voice llms infuriates
               | me. And the slow speed. And the over explanations.
               | ChatGPT advanced can be induced to talk more quickly,
               | less sycophantically and if I like in a not-bad regional
               | accent; essentially I want it to mirror my tone better.
               | But those inducements don't stick between sessions.
               | 
               | On the technical side having some sort of continuation or
               | summarization loop on seems interesting to me as a
               | product feature. It's not enough to build a company off
               | of though. But it would be nice.
        
           | wwwlouishinofun wrote:
           | Oh, you have completed the project I planned. Currently, do
           | you think the difficulty in improving the model lies in voice
           | data, computing power, or algorithm optimization? I
           | personally think that if you want to achieve the ultimate,
           | you don't need to remove the background sound from the
           | original audio. Outputting audio mixed with background sound
           | as new audio may result in background music,
           | 
           | If you use completely unprocessed speech data (including
           | speech information with background music on YouTube), I think
           | the potential will be higher, but the requirements on your
           | computing power are too high. If you don't have money to buy
           | a GPU, just use voice noise reduction processing first.
        
         | vanviegen wrote:
         | I think you're describing ChatGPT Advanced Voice Mode (or
         | Realtime API) in your second paragraph.
        
           | throwaway314155 wrote:
           | They were so busy inventing they forgot to do a basic Google
           | search to see if it had already been done.
        
       | Dawny33 wrote:
       | Congrats, team.
       | 
       | Does Hertz support multi-lingual audio right now?
        
         | nicholas-cc wrote:
         | Yes
        
         | wwwlouishinofun wrote:
         | I'm going to try it in Chinese
        
       | awinter-py wrote:
       | what is up with the first sample? and/or am I having a stroke
        
         | spuz wrote:
         | Pay attention to the given prompt length in the examples. The
         | first 2 seconds of the first example is a real human speaking.
         | Everything after is generated by the model. It produces what
         | almost sounds like real human speech mimicking the voice of the
         | input but it's currently at a level of something like GPT-2 in
         | terms of meaningful words.
        
       | reissbaker wrote:
       | This is really cool. FWIW, existing open-source TTS engines are
       | _really_ bad in comparison to what you have here: I know this is
       | voice-to-voice, but I think there 'd be a lot of appetite to get
       | this to also be multimodal and accept text (essentially making it
       | a really good TTS model, in addition to a great voice-to-voice
       | model).
       | 
       | I suppose someone could hack their way around the problem by
       | finetuning it to essentially replay Piper (or whatever) output,
       | only with more natural prosody and intonation. And then have the
       | text LLM pipe to Piper, and Piper pipe to Hertz-dev. But it would
       | be pretty useful to have it accept text natively!
        
         | netdevnet wrote:
         | They are a team of 4. At that size, it's better for them to be
         | focused on one thing than stretched out
        
           | reissbaker wrote:
           | Eh, that depends. A small model that's voice-and-text is
           | probably more useful to most people than scaling up a voice-
           | only model: the large voice-only model will have to compete
           | on intelligence with e.g. Qwen and Llama, since it can't be
           | used in conjunction with them; whereas a small voice+text
           | model can be used as a cheap frontend hiding a larger,
           | smarter, but more expensive text-only model behind it. This
           | is an 8b model: running it is nearly free, it can fit on a
           | 4090 with room to spare.
           | 
           | On the one hand, a small team focused on voice-to-voice could
           | probably do a lot better at voice-to-voice than a small team
           | focused on voice-to-voice+text. But a small team focused on
           | making the most useful model would probably do better at that
           | goal by focusing on voice+text rather than voice-only.
        
             | netdevnet wrote:
             | Their goal is not working on what's most useful for most
             | people though. That's the domain of the big AI players.
             | They are small and so specialising works best as that's
             | where they can have an edge as a company.
             | 
             | At the end of the day, the released product needs to be
             | good and needs to be done in a reasonable amount of time. I
             | highly doubt they can do a generic model as well as a more
             | specialised one.
             | 
             | But if you think you know better than them, you could try
             | to contact them even though it looks they are crazy laser
             | focused (their public email addresses are either for
             | investors or employee candidates).
        
         | PenisBanana wrote:
         | Yes, yes. This. Piper is already pretty good . . . and then
         | this.
         | 
         | It may not be _them_ doing it, though.
        
       | blixt wrote:
       | Gotta say I was confused for a second but yeah apparently si.inc
       | and ssi.inc are the domains for two different AGI companies and I
       | can only assume it's intentional?
        
         | imjonse wrote:
         | According to whois records si.inc was registered 5 days after
         | ssi.inc in June. So yes, maybe intentional.
        
           | programjames wrote:
           | But the company si.inc (Standard Intelligence) was founded
           | many months before ssi.inc (Safe Superintelligence), so they
           | likely just didn't want their domain name to get taken.
        
       | blixt wrote:
       | They say Hertz is first of its kind but Moshi is another duplex
       | audio model from earlier this year that seems to perform
       | similarly (and it runs on a MacBook): https://github.com/kyutai-
       | labs/moshi
        
         | nicholas-cc wrote:
         | Moshi is a good model to build chat applications on, this is
         | designed to be more of a proper base model with all the
         | quirkiness, naturalness, and researcher-friendliness of base
         | modeling.
        
         | a2128 wrote:
         | Moshi never released the base model, only two conversationally
         | finetuned models. They also never released training code except
         | for the codec. Though I don't see any training code for Hertz
         | either, just 3 inference notebooks, and model code full of
         | no_grad. No paper either to help me understand how this was
         | trained and what the architecture is like. So I'm not too sure
         | about researcher-friendliness unless I'm missing something.
        
           | nicholas-cc wrote:
           | We're working on a HuggingFace release that will help with
           | finetuning. We'd like to do a paper, after a larger release -
           | we're a team of 4.
        
             | netdevnet wrote:
             | Very impressive for just 4 people. What's the team
             | background and how long have you been working on this?
        
               | unit149 wrote:
               | For a rag-tag group of transcendental audiophiles
               | operating electronic circuitry, it ionizes and atomizes
               | well.
        
               | programjames wrote:
               | I'm not part of their team, but lived with them for a
               | couple months. They've been working on it for ~5 months,
               | and their background is 16-20 year olds who are too smart
               | for university.
        
         | underlines wrote:
         | - LLaMA-Omni https://github.com/ictnlp/LLaMA-Omni a speech-
         | language model built on Llama-3.1-8B-Instruct for simultaneous
         | generation of text and speech
         | 
         | - moshi https://github.com/kyutai-labs/moshi speech-text
         | foundation model using Mimi, a SOTA streaming neural audio
         | codec
         | 
         | - Mini-Omni https://github.com/gpt-omni/mini-omni multimodal
         | LLM based on Qwen2 offering speech input and output
         | 
         | - Ichigo https://github.com/homebrewltd/ichigo open research
         | project extending a text-based LLM to have native listening
         | ability, using an early fusion technique
        
       | xarope wrote:
       | the One-channel generation seems to be speaking gibberish
       | english. I'm not sure what it is supposed to represent?
       | 
       | And is the interactive generation just doing an ELIZA? i.e. "P:
       | tell us about how AI will be interesting", "A: Yeah AI will,
       | yeah, be interesting".
        
       | briansm wrote:
       | The codec parameters remind me of the ~300bps NRV military speech
       | codec from 2010. It also uses 120ms (8hz) frames, vbr encoded
       | using 16KHz audio (closed source though).
       | 
       | https://ieeexplore.ieee.org/document/5680311
        
       | codedokode wrote:
       | The voice sounds a little bit distorted, and there is often a
       | noise in the background (especially noticeable when this noise
       | disappears when the voice pauses). I wonder, is it model
       | limitations or is it the problem with quality of training data?
        
       | jcims wrote:
       | If the authors or anyone else that works on a voice model are in
       | here, do you ever get creeped out or feel the sounds you're
       | getting from the system have a physiological effect on you?
        
       | ryukoposting wrote:
       | The voice samples are speaking gibberish a lot of the time, but
       | sonically the voices are fantastic. They sound _human_ , even if
       | it's nonsense syllables.
       | 
       | With SD and LLMs, there's a lot you can do to debug it by
       | studying the way it responds to small changes in the prompt. But,
       | since Hertz-dev is using sound as its input, it would be hard to
       | discern which token you should tweak. Of course, if it's meant to
       | be used in real time, that kind of fiddling isn't an option at
       | all. How would you go about systematically studying Hertz-dev's
       | behavior?
        
       | Jayakumark wrote:
       | What is the license on model weights ?
        
       | kunley wrote:
       | Anything more about the company, founders, affiliations..?
        
         | ttul wrote:
         | Some commits are by `nicholascc`
         | (https://github.com/nicholascc); via Twitter, he seems to be
         | Nicholas Charette. Nicholas is a first year student at
         | Stanford. For such a young group, this is a really impressive
         | effort!
        
       | zachthewf wrote:
       | Cool, looks like this is trained on 16 million hours of audio
       | (500B tokens at ~.11 seconds per token).
       | 
       | Even the large open source TTS models (see F5 TTS, Mask GCT) are
       | mostly trained on very small audio datasets (say 100k hours)
       | relative to the amount of audio available on the internet, so
       | it's cool to see an open source effort to scale up training
       | significantly.
        
       | mazoza wrote:
       | Can one of the authors explain what this actually means from the
       | post?
       | 
       | hertz-vae: a 1.8 billion parameter transformer decoder which acts
       | as a learned prior for the audio VAE. The model uses a context of
       | 8192 sampled latent representations (17 minutes) and predicts the
       | next encoded audio frame as a mixture of gaussians. 15 bits of
       | quantized information from the next token act as semantic
       | scaffolding to steer the generation in a streamable manner.
        
         | programjames wrote:
         | My guess:
         | 
         | 1. `codec`: First, compress 16k samplerate audio into 8 samples
         | per second with convolutions. Then, vector quantize to 128 bits
         | (probably 8 floats) to get a codec. This is not nearly enough
         | bits to actually represent the audio, it's more to represent
         | phenomes.
         | 
         | 2. `vae` -> This looks like a VAE-based diffusion model, that
         | uses the codec as its prompt.
         | 
         | 3. `dev` -> This is a next-codec prediction model.
         | 
         | Put together, it probably runs like so:
         | 
         | 1. Turn your prompt into tokens with the `codec`.
         | 
         | 2. If you want s more seconds of audio, use `dev` to predict 8
         | * s more tokens.
         | 
         | 3. Turn it back into audio with the `vae` diffusion model.
        
       | m11a wrote:
       | > Base models are uniquely valuable as a research product because
       | they accurately model the distribution of the data that they were
       | trained on, as opposed to models that have had substantial RL
       | tuning done to collapse their generation distributions. This
       | makes base models the best starting point to fine-tune for a
       | large number of different tasks.
       | 
       | Is this idea ('collapse of their generation distributions') a
       | researched topic? If so, under what name?
       | 
       | Sounds interesting and maybe related to the whole continual
       | learning / how to finetune properly line of work
        
       ___________________________________________________________________
       (page generated 2024-11-04 23:02 UTC)