[HN Gopher] Neural audio codecs: how to get audio into LLMs
       ___________________________________________________________________
        
       Neural audio codecs: how to get audio into LLMs
        
       Author : karimf
       Score  : 313 points
       Date   : 2025-10-21 12:55 UTC (10 hours ago)
        
 (HTM) web link (kyutai.org)
 (TXT) w3m dump (kyutai.org)
        
       | mondainx wrote:
       | Thanks for sharing this well written post that I will share with
       | my team; we just recently started using audio/voice in our AI
       | suite and the details herein will be helpful and informative.
        
       | amelius wrote:
       | > Many LLMs have voice interfaces, but they usually work by
       | transcribing your speech, generating the answer in text, and
       | using text-to-speech to read the response out loud. That's
       | perfectly fine in many cases (...), but it's a wrapper, not real
       | speech understanding.
       | 
       | But I can say the same about tokenization. LLMs first convert
       | groups of characters to tokens, then use that to generate tokens,
       | and then convert the tokens back to characters. That's not real
       | understanding! If LLMs are so smart, we should be able to skip
       | the tokenization step.
        
         | Workaccount2 wrote:
         | Nothing is real understanding because we have no benchmark for
         | understanding because we don't mechanistically know what
         | understanding is. The best we have is people "vibe knowing" a
         | benchmark that they made up on the spot.
        
       | trollbridge wrote:
       | An ongoing question I have is why effort wasn't put into
       | tokenising speech (instead of transcribed words) and then making
       | an LLM out of that. There are huge amounts of speech available to
       | train on.
        
         | MichealCodes wrote:
         | I don't think we've had the transformer moment for audio
         | training yet, but yes, in theory audio-first models will be
         | much more capable.
        
           | trollbridge wrote:
           | Particularly interesting would be transformations between
           | tokenised audio and tokenised text.
           | 
           | I recall someone telling me once up to 90% of communication
           | can be non-verbal, so when an LLM sticks to just text, it's
           | only getting 10% of the data.
        
         | benob wrote:
         | Audio tokenization consumes at least 4x tokens versus text. So
         | there is an efficiency problem to start with. Then is there
         | enough audio data to train a LLM from scratch?
        
           | trollbridge wrote:
           | Start an MVNO that offers cheaper phone plans and and train
           | on all those phone calls.
           | 
           | There are big libraries of old speeches.
           | 
           | Simply capture all all current radio/tv transmissions and
           | train on that (we've already established copyright doesn't
           | apply to LLM training, right?)
        
             | miki123211 wrote:
             | > Start an MVNO that offers cheaper phone plans and and
             | train on all those phone calls.
             | 
             | q: What is 2+2?
             | 
             | A: The warranty for your car has expired...
        
           | 542354234235 wrote:
           | Don't we have tens of thousands of hours (hundreds of
           | thousands?) of closed captioned tv shows and movies? How many
           | hours of news broadcasts with transcripts do we have? Maybe I
           | just don't understand what is needed, but it seems like we
           | have a lot of data to work with.
        
             | roboror wrote:
             | Sure but that needs to be licensed
        
             | cruffle_duffle wrote:
             | Correct me if I'm wrong but you need more than just closed
             | captions. You need precise timing too. I'd think you'd need
             | the text to line up exactly with the audio so when the
             | voice makes an "A" sound the text it aligns with is "A" as
             | well.
             | 
             | So while having the closed captions saves some of the work,
             | there is probably much more needed to get everything lined
             | up.
             | 
             | But I'm absolutely not an expert at all. In fact this is
             | the first I've ever even though about it!
        
               | vvolhejn wrote:
               | Author here. Speech-to-text is more or less solved, it's
               | easy to automatically get captions including precise
               | timestamps. For training Moshi, Kyutai's audio LLM, my
               | colleagues used whisper-timestamped to transcribe 7
               | million hours of audio.
               | 
               | See Section 4.2 in the Moshi paper:
               | https://arxiv.org/pdf/2410.00037
        
               | cruffle_duffle wrote:
               | Sweet!
        
         | ca_tech wrote:
         | There is data but nowhere near the amount of written language
         | that is fairly normalized and without the need to account for
         | additional features such as language, dialect, intonation,
         | facial expression, hand gestures. Speech to text is used as the
         | translation layer as it throws many of those other features
         | away and contextualizes it into a set of tokens that are much
         | more efficient to map between languages.
        
         | mohsen1 wrote:
         | It costs more to train on audio tokens but I'm sure we will get
         | there. Training a model on transcript of a lecture on YouTube
         | vs. training on audio of it will make a difference.
        
         | nmfisher wrote:
         | The article is talking about doing exactly that. The key
         | question is how to convert an inherently continuous signal
         | (speech/audio) into a discrete set of tokens. A single window
         | of audio is usually somewhere between 10ms and 100ms. It's
         | difficult to squish all that information down to a single
         | "token" that represents the semantic and acoustic content for
         | that window.
         | 
         | That's why residual vector quantization is a useful technique -
         | using multiple dictionaries to quantize a single timeslice,
         | each conditioned on the previous residual level. You can also
         | quantize a signal at different frequencies.
         | 
         | There are samples towards the end of the post of their LLM
         | trained on their Mimi audio codec.
        
           | CGMthrowaway wrote:
           | _> The key question is how to convert an inherently
           | continuous signal (speech /audio) into a discrete set of
           | tokens. A single window of audio is usually somewhere between
           | 10ms and 100ms. It's difficult to squish all that information
           | down to a single "token" that represents the semantic and
           | acoustic content for that window._
           | 
           | I read the article and confess some of the modeling parts
           | were above my comprehension. But I would like to add that as
           | an audio engineer, the "key question" you describe is solved,
           | just not applied to transformer models (?).
           | 
           | An experienced engineer can look at a waveform in a DAW and
           | identify specific consonants, vowels, specific words, etc
           | quite fluently. And with tools like Melodyne - which already
           | quantize audio semantically - they can identify (and
           | manipulate) pitch and formants as well, turning an O vowel
           | into an E vowel, or changing the inflection of a phrase (up-
           | speak vs down-speak, for example).
           | 
           | I don't know how to apply this to a neural codec, but it
           | seems like it shouldn't be that hard (that's my naivete
           | coming through)
        
             | PaulDavisThe1st wrote:
             | > An experienced engineer can look at a waveform in a DAW
             | and identify specific consonants, vowels, specific words,
             | etc quite fluently.
             | 
             | As an experienced DAW author, I very, very much doubt this.
             | 
             | What can be done relatively easy is to "see" or rather
             | "follow along" in the waveform when listening to the audio.
             | But I read your claim as being that someone could look at
             | the waveform (which is already decimated from the original)
             | and identify words or phonemes without hearing the
             | associated audio. I am extremely skeptical that there is
             | anyone anywhere in the world who can do this.
        
               | CGMthrowaway wrote:
               | I started in music but have since edited thousands of
               | hours of podcasts. I cannot transcribe a track by looking
               | at the waveform, except the word "um" haha. But without
               | playing the audio I can tell you where words start and
               | end, whether a peak is a B or a T or an A or an I
               | sound... And melodyne can add layers to that and tell me
               | the pitch, formants (vowels), quantize the syllables etc.
               | If I can do all this, a computer ought to be able to do
               | the same and more
        
               | spudlyo wrote:
               | Hundreds of hours here, and I can't even always reliably
               | spot my own ums. I edit as many out as I possibly can for
               | myself, my co-host and guest, as well as eliminating
               | continuation signaling phrases like "you know" and
               | "like". I also remove uninteresting asides and bits of
               | dead air. This is boring and tedious work but it makes
               | the end result considerably better I think.
               | 
               | I feel like there should be a model that can do much of
               | this for me but I haven't really looked into it,
               | ironically due to laziness, but also because I edit
               | across multiple tracks at this stage, and I'm afraid to
               | feed the model an already mixed stereo track. I'm curious
               | why you still do it manually, if you still do and if
               | you've looked into alternatives.
        
               | PaulDavisThe1st wrote:
               | > I edit as many out as I possibly can for myself, my co-
               | host and guest, as well as eliminating continuation
               | signaling phrases like "you know" and "like". I also
               | remove uninteresting asides and bits of dead air.
               | 
               | Hopefully using Ardour's "Ripple - Interview" mode :))
        
               | vvolhejn wrote:
               | I use Descript to edit videos/podcasts and it works great
               | for this kind of thing! It transcribes your audio and
               | then you can edit it as if you were editing text.
        
               | PaulDavisThe1st wrote:
               | Yeah, that stuff is just freaking amazing. I don't know
               | what the transcription quality is like, but if I was
               | doing this as a job, and it was good at transcription,
               | I'd definitely be using that all the time.
        
             | jampekka wrote:
             | > An experienced engineer can look at a waveform in a DAW
             | and identify specific consonants, vowels, specific words,
             | etc quite fluently.
             | 
             | DAWs' rendered waveforms have so little information that
             | such identification is likely impossible even in theory.
             | Telling apart plosives and vowels maybe, but not much more
             | than that.
             | 
             | I work with phoneticians and they can (sometimes) read even
             | words from suitably scaled spectrograms, but that's a lot
             | more information than in waveforms.
        
           | duped wrote:
           | > the key question is how to convert an inherently continuous
           | signal (speech/audio) into a discrete set of tokens
           | 
           | Did Claude Shannon not answer this question in 1948? You need
           | at least 1 bit per 6dB of dynamic range for each symbol and
           | 2B symbols per second where B is the bandwidth of the signal.
           | 
           | Compression techniques are all about getting below that
           | fundamental limit but it's not like this is an unsolved
           | problem. Or is 1kbaud too much for LLMs?
        
           | generuso wrote:
           | One of the popular speech-to-text models is Whisper, which
           | starts with the conventional spectral analysis of the speech
           | signal, and then feeds the data into a Transformer model. It
           | works quite well.
           | 
           | https://openai.com/index/whisper/
           | 
           | Such approach dates back to 1940s, when people were trained
           | to read the speech from spectrograms. There is a 1947 book
           | "Visible Speech" by Potter, Kopp, and Green describing these
           | experiments. Here is a more slightly recent 1988 review of
           | the subject: "Formalizing Knowledge Used in Spectrogram
           | Reading"
           | 
           | https://apps.dtic.mil/sti/tr/pdf/ADA206826.pdf
        
       | bkitano19 wrote:
       | Awesome post!
        
         | krackers wrote:
         | Indeed, the title undersells it and I'm glad I didn't skip over
         | it, the article is basically an information-dense but
         | approachable summary of audio generation.
        
       | robviren wrote:
       | This has got to be one of the most visually pleasing explanations
       | I have seen of these concepts. Congrats!
       | 
       | I attempted some similar VQ-VAE work instead trying to tokenize
       | rendered text. I was curious if I could make a visual llm working
       | on 10 pt rendered font, but I also tried using PDF sources. The
       | basic idea was to do what more advanced diffusion image models
       | can do where they generate images of text. Make a specific image
       | text diffusion model to do completions. Further I wondered if I
       | could embed things like document type and language so you could
       | have a latent representation of text more abstracted than current
       | dictionary tokenizers. Learned a lot and thought it was all
       | beautifully displayed in this post.
        
       | crazygringo wrote:
       | This is fascinating.
       | 
       | Obviously working directly with audio is vastly more complex than
       | with text.
       | 
       | But it is very exciting to see how part of making LLMs work
       | natively with speech, is finding a codec that is maximally
       | efficient at encoding speech.
       | 
       | I even have to wonder if, at some point, we ultimately create a
       | popular voice codec usable with LLMs based not on the Fourier
       | transform or similar, but rather on some kind of set of physical
       | parameters describing vocal cord shape, tongue position,
       | throat/chest/mouth shape, etc.
       | 
       | I can imagine such a model being arrived at statistically
       | (determining the necessary number of parameters), and then almost
       | becoming "hard-coded" as a standard since human anatomy doesn't
       | change much there, beyond certain ranges.
       | 
       | I think it's called formant speech encoding, and it would be
       | interesting if LLMs wind up massively advancing that field. Since
       | I think historically it's had to do more with speech synthesis
       | than audio compression.
        
         | quinndupont wrote:
         | There's a long history of attempts at artificial speech that
         | take this approach, recreating mouth parts and vibrating air.
         | They are all pretty silly, like this work, which fails to
         | understand how writing isn't just a derivative of speech.
        
           | crazygringo wrote:
           | > _They are all pretty silly,_
           | 
           | Huh? How?
           | 
           | > _like this work which fails to understand how writing isn't
           | just a derivative of speech._
           | 
           | The whole point of the article is that writing isn't just a
           | derivative of speech. It's in the introduction.
        
         | duped wrote:
         | In speech coding/synthesis this called a "source-filter" model
         | (decompose speech production into a sound generator in the
         | vocal folds and filter in the vocal tract, and parameterize
         | them) and it's actually older than Tukey and Cooley's
         | rediscovery of the FFT.
        
         | vvolhejn wrote:
         | Author here, thanks for the kind words! I think such a physics-
         | based codec is unlikely to happen: in general, machine learning
         | is always moving from handcrafted domain-specific assumptions
         | to leaving as much as possible to the model. The more
         | assumptions you bake in, the smaller the space of sounds you
         | can model, so the quality is capped. Basically, modern ML is
         | just about putting the right data into transformers.
         | 
         | That being said, having a more constrained model can also lead
         | to some really cool stuff. The DDSP paper learns how to control
         | a synthesizer to mimic instruments:
         | https://arxiv.org/abs/2001.04643
         | 
         | You could probably do something similar for a speech model. The
         | result would not sound as good but you could get away with much
         | fewer parameters, because much of the modelling work is done by
         | the assumptions you put in.
         | 
         | Compare also KokoroTTS, a tiny TTS that's so tiny because it
         | uses a handcrafted system to turn text into phonemes, and then
         | just synthesizes from those phonemes:
         | https://huggingface.co/spaces/hexgrad/Kokoro-TTS
        
       | bob1029 wrote:
       | Why not normal audio codecs? How are JPEG and MP3 (i.e.,
       | DCT/MDCT) not a reasonable way to go about tokenizing spatial and
       | time domain signals for these kinds of models?
       | 
       | Each MP3 frame is entirely self-contained and can completely
       | reconstruct a few tens of milliseconds of original audio. It does
       | not require other frames to do this. I think this is the most
       | important element. At 128kbps CBR, each MP3 frame is ~418 bytes
       | and covers ~26 milliseconds of time. This is a reduction of
       | 10-11x over the raw PCM waveform. MP3 is also designed to
       | eliminate the information that humans don't seem to care about.
       | 
       | I don't know if it's possible to use 400 byte tokens in a
       | transformer model, but I would be very compelled to try.
        
         | PaulDavisThe1st wrote:
         | The approach in TFA encodes into a 32 dimensional space. I
         | suspect this is significantly more dimensions than any psycho-
         | acoustic compression algorithm uses. Also, throwing away
         | information that our hearing systems can't process very well is
         | not particularly useful if your goal is speech (or more
         | generally, audio) synthesis from scratch.
        
           | bob1029 wrote:
           | > throwing away information that our hearing systems can't
           | process very well is not particularly useful if your goal is
           | speech (or more generally, audio) synthesis from scratch.
           | 
           | I'm not sure I follow. If there is a set of tokens that the
           | average human cannot perceive, why wouldn't we want to
           | eliminate them from the search space? Who is the target
           | audience for this model?
        
             | CaptainOfCoit wrote:
             | Maybe that things outside our audible range could
             | impact/influence things inside of our audible range?
        
             | 542354234235 wrote:
             | I imagine it would be like if there were Rosetta Stones of
             | text, written with a language you could read and a language
             | you couldn't. For your purposes, discarding the text you
             | can't read would be fine and you wouldn't lose anything.
             | But if you were ingesting a bunch into an LLM, the
             | additional text would give the LLM more context and help it
             | make connections and relate words more accurately, even if
             | you never were going to have it output anything in the
             | language you don't understand.
             | 
             | The inaudible sounds add context and additional datapoints
             | on how the audible sounds are related.
        
             | PaulDavisThe1st wrote:
             | Humans that read (at least) Indo-European languages can
             | read texts in their native language with all the vowels
             | removed. Does that suggest that it would be a good idea to
             | remove the vowels from text before using it for training
             | text-based LLMs ?
             | 
             | Presumably you want to train on as rich a set of data as
             | possible, even if some of that data is redundant or
             | irrelevant when it comes to human perception.
        
               | Tangurena2 wrote:
               | Generally, the difference between regional dialects is
               | almost all in vowels (sample: 0). This is why SOUNDEX [1]
               | eliminated vowels.
               | 
               | 0 - https://www.acelinguist.com/2020/01/the-pin-pen-
               | merger.html
               | 
               | 1 - https://en.wikipedia.org/wiki/Soundex
        
         | ACCount37 wrote:
         | You _can_ try to train an adapter from a raw 400-byte MP3 frame
         | to an embedding for a given LLM (4096+ floating point numbers,
         | exact precision varies).
         | 
         | But you'd need that information to be _digestible_ for a neural
         | network. Otherwise, you 'll have a very hard time getting that
         | adapter to work.
         | 
         | As a rule: neural networks love highly redundant data, and hate
         | highly compressed data at their inputs. Tokenized text good,
         | GZIP compressed bytestream bad. But who knows, really. It's a
         | rule of thumb, not a mathematical law. So you could have some
         | success getting that MP3-based adapter to work. I've seen
         | weirder shit work.
        
           | a-dub wrote:
           | if you were able to normalize and quantokenize the distinct
           | dct values in a consistent way, it could be an interesting
           | approach. so yeah, undo the bit packing but keep the front
           | end signal processing and compressed dct representation and
           | viola! something quite weird that might actually work. :)
        
         | WithinReason wrote:
         | JPEG is a good idea:
         | 
         |  _The simple, elegant approach of training convolutional neural
         | networks (CNNs) directly from RGB pixels has enjoyed
         | overwhelming empirical success. But can more performance be
         | squeezed out of networks by using different input
         | representations? In this paper we propose and explore a simple
         | idea: train CNNs directly on the blockwise discrete cosine
         | transform (DCT) coefficients computed and available in the
         | middle of the JPEG codec. Intuitively, when processing JPEG
         | images using CNNs, it seems unnecessary to decompress a
         | blockwise frequency representation to an expanded pixel
         | representation, shuffle it from CPU to GPU, and then process it
         | with a CNN that will learn something similar to a transform
         | back to frequency representation in its first layers. Why not
         | skip both steps and feed the frequency domain into the network
         | directly? In this paper we modify \libjpeg to produce DCT
         | coefficients directly, modify a ResNet-50 network to
         | accommodate the differently sized and strided input, and
         | evaluate performance on ImageNet. We find networks that are
         | both faster and more accurate, as well as networks with about
         | the same accuracy but 1.77x faster than ResNet-50._
         | 
         | https://proceedings.neurips.cc/paper_files/paper/2018/file/7...
         | 
         | I suspect mp3 is also a good idea
        
         | cubefox wrote:
         | I believe language models usually use 2-byte (16 bit) tokens,
         | which corresponds to an embedding dimension of 2^16=65536. With
         | 400 bytes per token this would be 2^(400*8), which is an
         | extremely large number. Way too large to be practical, I
         | assume.
        
         | vvolhejn wrote:
         | Author here. There are a few reasons, but the biggest one is
         | simply the compression ratio.
         | 
         | The OG neural audio codec SoundStream (whose first author is
         | Neil, now at Kyutai) can sound decent at 3kbps, whereas MP3
         | typically has around 128kbps, as you say. Interestingly, it was
         | originally developed for audio compression for Google Meet, not
         | for LLMs. Today's neural codecs have even better compression.
         | 
         | The more modern MP3 alternative is Opus, which can work ok at
         | 12kbps, but it's still less efficient than neural audio codecs.
         | However, these traditional codecs are a lot less CPU-hungry, so
         | they have that going for them.
        
           | espadrine wrote:
           | That makes sense.
           | 
           | Why RVQ though, rather than using the raw VAE embedding?
           | 
           | If I compare rvq-without-quantization-v4.png with
           | rvq-2-level-v4.png, the quality seems oddly similar, but the
           | former takes a 32-sized vector, while the latter takes two
           | 32-sized (one-hot) vectors, (2 = number of levels, 32 =
           | number of quantization cluster centers). Isn't that more?
        
             | vvolhejn wrote:
             | I had a part about this but I took it out: for compression,
             | you could keep the embeddings unquantized and it would
             | still compress quite well, depending on the embedding
             | dimension and the number of quantization levels.
             | 
             | But categorical distributions are better for modelling.
             | It's a little difficult to explain here without using
             | diagrams. The intuition is that if you try to have a model
             | predict the next _embedding_ and not the next _token_ , you
             | can't model multimodal distributions - you'll end up
             | predicting the mean of the possible continuations and not
             | the mode, which is not what you want.
             | 
             | Check out Section 5.3 and Figure 6 from PixelRNN, where
             | they discuss this phenomenon:
             | https://arxiv.org/pdf/1601.06759
             | 
             | At the bottom of the blog, I link two articles that do make
             | continuous embeddings work. One of them is the Kyutai paper
             | Continuous Audio Language Models:
             | https://arxiv.org/abs/2509.06926
        
         | HarHarVeryFunny wrote:
         | Human audio perception is based on detecting the frequency
         | components, which we detect via what amounts to a filter bank
         | in the inner ear (different length hairs with different
         | resonant frequencies).
         | 
         | Speech perception builds upon frequencies and is based on
         | "formants" - the frequency bands that are attentuated via the
         | vocal tract resonances created by articulation when the speech
         | was generated. More specifically, most speech information is
         | contained in formant _changes_ since these correspond to
         | articulatory changes. There are also other articulatory
         | artifacts in speech such as the onsets of speech energy
         | corresponding to plosives ( "puh", "buh"), and the high
         | frequencies generated by fricatives like "sss".
         | 
         | One problem with embedding MP3 frames as audio tokens would be
         | that although MP3 compression is based on frequency
         | representation, you've then got quantization, huffman encoding
         | and the MP3 frame structure all on top of that, so the frame as
         | a whole is going to be more of a black box. Presumably a
         | transformer could still use MP3 frames to predict the text
         | transcription, or any arbitrary encoding of speech audio for
         | that matter (similar to how an LLM can predict text from Base64
         | representation, or vice versa), but it's certainly not making
         | it easier if the input is obfuscating the frequency components
         | and formants etc that correspond to the generating process.
         | 
         | Not having direct access to the frequency/formant information
         | is also going to make generalization more difficult since that
         | is based around formant structure and changes. When
         | articulating the same word, the specific formant frequencies
         | will differ between individuals, primarily based on vocal tract
         | length, but humans have no problem generalizing across these
         | and understanding speech from different individuals. I'm not
         | sure if an LLM only trained to predict MP3 speech from, say,
         | male adults, would necessarily have generalized enough to also
         | be able to recognize child speech or that from a speech
         | synthesizer.
        
       | miki123211 wrote:
       | > Try asking any of them "Am I speaking in a low voice or a high
       | voice?" in a high-pitched voice, and they won't be able to tell
       | you.
       | 
       | I wonder how much of that is LLMs being bad, and how much is LLMs
       | being (over) aligned not to do it.
       | 
       | AFAIK, Chat GPT Voice mode had to have a lot of safeguards put on
       | it to prevent music generation, accent matching (if you sound
       | Indian, it shouldn't also sound Indian), and assuming ethnicity /
       | biasing based on accents.
       | 
       | It doesn't seem that impossible to me that some of these
       | behaviors have been aligned out of these models out of an
       | abundance of caution.
        
         | sbrother wrote:
         | I don't think it's just safeguards; they really don't seem to
         | understand pitch at all. I tried asking ChatGPT's advanced
         | voice mode to recognize a tune I was humming, and it insisted
         | it was Beethoven's 5th -- multiple times. I think it must have
         | basically tokenized my humming to "dun dun dun duuun".
        
           | bigzyg33k wrote:
           | advanced voice mode operates on audio tokens directly, it
           | doesn't transcribe them into "text tokens" as an intermediate
           | step like the original version of voice mode did.
        
             | cubefox wrote:
             | But they behave just like models which use text tokens
             | internally, which is also pointed out at the end of the
             | above article.
        
               | bigzyg33k wrote:
               | we don't know if that's due to inherent limitations of
               | the tokenisation of audio, or a byproduct of
               | reinforcement learning. In my own usage, I noticed a
               | significant degradation in capabilities over time from
               | when they initially released advanced voice mode. The
               | model used to be able to sing, whisper, imitate sounds
               | and tone just fine, but I imagine this was not intended
               | and has subsequently been stunted via reinforcement
               | learning.
               | 
               | I don't find the articles argument that this is due to
               | tokenisation convincing.
        
               | cubefox wrote:
               | They didn't say it's due to tokenization.
               | 
               | > This is likely because they're trained on a lot of data
               | generated synthetically with text-to-speech and/or
               | because understanding the tone of the voice (apparently)
               | doesn't help the models make more accurate predictions.
        
             | sbrother wrote:
             | right, but either whatever audio tokenization it's doing
             | doesn't seem to encode pitch, or there's ~nothing where
             | pitch is relevant in the training set.
        
             | oezi wrote:
             | Absolutely correct! My simple test is if it can tell
             | American and British English Tomato and Potato apart. So
             | far it can't.
        
               | fragmede wrote:
               | Which "it" are you referring to? There are models that
               | can.
        
         | idonotknowwhy wrote:
         | Qwen3 omni transcriber can do this. It can describe the voice,
         | emotion very well
        
           | 85392_school wrote:
           | I've also had luck with Gemini. If I made a few noises and
           | asked which one was higher pitched, it could easily tell.
        
         | tsol wrote:
         | Did they respond differently depending on what race they
         | thought you were? I'm surprised they would even do that
         | honestly. I thought they were trained on text conversations
         | which presumably wouldn't have any of that to learn from.
        
           | OisinMoran wrote:
           | You can often tell where someone is from from text alone!
           | There are plenty of idiosyncrasies even in how different
           | English speaking countries use the language.
        
             | anotherhue wrote:
             | Ah stop
        
             | fragmede wrote:
             | Like, what do you mean? Are there, like, particular
             | mannerisms that people from some regions that are hella
             | unique to those regions?
        
               | robotresearcher wrote:
               | I say old chap, what colour are your mummy's wellies?
        
               | ctxc wrote:
               | Clever!
        
               | ElevenLathe wrote:
               | You betcha!
        
           | thwarted wrote:
           | If it did, it responded based on the accent it picked up on
           | not race, because race and accent are orthogonal, correlation
           | does not imply causation.
        
             | dotancohen wrote:
             | Are denying that race and accent are highly correlated?
        
           | j45 wrote:
           | There are subtle differences in language where two groups can
           | be speaking English and one is having a completely different
           | conversation without saying much.
        
             | dotancohen wrote:
             | This is quite the reason my wife evolved into my ex-wife.
        
         | vvolhejn wrote:
         | Author here. I think it's more of a capability issue than a
         | safety issue. Since learning audio is still harder than
         | learning text, audio models don't generalize as well. To fix
         | that, audio models rely on combining information from text and
         | audio (having a single model that consumes/produces both text
         | and audio tokens) and the audio tokens basically end up being
         | an integrated speech-to-text/text-to-speech. This reflects my
         | colleagues' experience working on Moshi, and it seems to be the
         | case for other models too, see the Conclusion section.
         | 
         | Part of the reason can also be synthetic data: if you fine-tune
         | on data generated from text via a text-to-speech, the tone of
         | the voice doesn't have any information, so the model learns to
         | ignore it.
        
           | j45 wrote:
           | Accent detection or consciously ignoring it is a filter step.
        
           | JoshTriplett wrote:
           | Audio models for speech not understanding pitch, seems
           | similar to how text LLMs often don't understand spelling:
           | it's not what they were trying to recognize.
        
           | oezi wrote:
           | > generated from text via a text-to-speech
           | 
           | Yes, frustratingly we don't have good speech-to-text
           | (STT/ASR) to transcribe such differences.
           | 
           | I recently finetuned a TTS* to be able to emit laughter and
           | hunting for transcriptions which include non-verbal sounds
           | was the hardest part of it. Whisper and other popular
           | transcription systems will ignore sigh, sniff, laugh, etc and
           | can't detect mispronounciations etc.
           | 
           | * = https://github.com/coezbek/PlayDiffusion
        
             | jasonjayr wrote:
             | IIRC -- the 15.ai dev was training on fan-made "My Little
             | Pony" transcriptions, specificaly because they included
             | more emotive clues in the transcription, and supported a
             | syntax to control the emotive aspect of the speech.
        
               | dotancohen wrote:
               | Where can I read about this?
        
           | smusamashah wrote:
           | There was an example, of ChatGPT copying and responding in
           | the speakers voice mid conversation, on OpenAI blog. This was
           | presented an example on non-alignment.
        
           | wordglyph wrote:
           | I used aistudio and it understood pitch and and even emotion
           | with an uploaded mp3
        
         | bongodongobob wrote:
         | Hmm, the last time I played with GPT voice mode it was able to
         | do all kinds of different accents.
        
       | quinndupont wrote:
       | Y'all need to learn about the history and development of spoken
       | language and writing. Writing isn't just a copy or derivation of
       | writing. LLMs work because of the conceptual characteristics of
       | writing (consider the distinctions between ideographic,
       | logographic, alphabetical...). What a sloppy mess!
       | 
       | Read some Wittgenstein and Goodman, but especially Derrida who
       | calls this logocentrism.
        
       | daxfohl wrote:
       | Another interesting thing here is that the model presumably has
       | some understanding of the passage of time. That's one thing that
       | can be odd about chat models, in that they will respond the same
       | no matter whether you respond a second later or a month later.
       | 
       | I think even for text models, "streams" could be useful. Perhaps
       | if the LLM sees too long of a pause after explaining something
       | and asking a question, they could interject a "do you need help?"
       | or something. Pure chat GPTs don't have that ability.
        
       | daxfohl wrote:
       | I wonder if a linear-space, constant-time model like RWKV or S4
       | would work better here. For audio, I wouldn't think you'd need
       | long range context, and all-to-all mapping seems like overkill.
       | 
       | Maybe a transformer could be running in parallel, but much lower
       | frequency, where the linear model feeds it "summary" tokens once
       | per second, whose information would mostly be "text", but also
       | some hint of emotion and other cues. Then the output of this
       | could be fed back to the linear model so that it would know what
       | it was saying and with what emotion. Basically the transformer
       | would be the low frequency long range context thinker (and
       | feeler), and the linear model would translate that to and from
       | phonetics.
       | 
       | They'd be trained in parallel, so those transformer tokens would
       | attain meaning at training time, not something that would have to
       | be pre-defined. So it'd still be purely phonetic e2e, no direct
       | translation to text. It could even end up being a good way to
       | compress text for LLMs, since low-value words might have smaller
       | representation in the token.
       | 
       | Probably would never reach the level of text based LLMs for logic
       | and code and such, but that somewhat parallels humans anyway;
       | it's pretty hard to explain an algorithm in detail in plain
       | conversation.
        
         | tehnub wrote:
         | Write this paper please!
        
           | daxfohl wrote:
           | If anyone wants to buy me some GPU time I'd be happy to try
           | it out! Fair warning: my only experience in deep learning
           | thus far was training a CNN to count dots on an image, which
           | worked semi reliably up to 8, when the image was perfectly
           | square black "dots" on a perfectly white background.
        
             | smokel wrote:
             | Off-topic, but it would be great if everyone who voiced
             | their opinion on something would add a small disclaimer
             | with their actual knowledge about the subject. Thanks for
             | sharing :)
        
             | fragmede wrote:
             | Sure. what's your venmo?
        
         | vvolhejn wrote:
         | I don't know about linear models, but this kind of hierarchical
         | modelling is quite a common idea in speech research. For
         | example, OpenAI's Jukebox (2020) [1], which uses a proto-neural
         | audio codec, has three levels of encoding that get coarser and
         | coarser. They use a language model to predict continuations in
         | the coarsest level and then have models to upscale to the finer
         | levels and finally back to audio.
         | 
         | The recent MiMo-audio bunches tokens into "patches" of four
         | timesteps and has the model predict those. [2]
         | 
         | [1] https://arxiv.org/abs/2005.00341
         | 
         | [2] https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/MiMo-
         | Audi...
        
       | lxe wrote:
       | I've been messing around with Higgs Audio that actually uses the
       | delay pattern. It has to apply it and then unapply it after the
       | generation. I noticed it's actually really hard to chunk and
       | stream audio correctly when you need to apply and reapply these
       | patterns essentially to the "entire" output.
        
       | mmaunder wrote:
       | Thanks for posting, I wasn't aware of Kyutai and it seems your
       | work is perfect for something I'm working on.
        
       | croemer wrote:
       | Typo: "not even a the length of one word"
        
         | vvolhejn wrote:
         | merci, will fix tomorrow
        
       | Rickasaurus wrote:
       | I wouldn't mind so much if they cheat on the way back but listen
       | in earnest. There are use cases like teaching language where
       | having the AI understand the sounds carefully matters a ton.
        
       ___________________________________________________________________
       (page generated 2025-10-21 23:00 UTC)