[HN Gopher] Generating music in the waveform domain (2020)
       ___________________________________________________________________
        
       Generating music in the waveform domain (2020)
        
       Author : jszymborski
       Score  : 64 points
       Date   : 2024-03-26 14:38 UTC (8 hours ago)
        
 (HTM) web link (sander.ai)
 (TXT) w3m dump (sander.ai)
        
       | tgv wrote:
       | Isn't that usually called the time domain?
        
         | DiogenesKynikos wrote:
         | The author makes a distinction between two different modeling
         | approaches:
         | 
         | 1. Representing music as a series of notes (with additional
         | information about dynamics, etc.), and then building a model
         | that transforms this musical score into sound.
         | 
         | 2. Modeling the waveform itself directly, without reference to
         | a score. This waveform could presumably be in either the
         | frequency or the time domain, but the author chooses to use the
         | time domain.
         | 
         | The author's terminology is a bit confusing, but I think that
         | they mean option 2 when they say "the waveform domain."
        
           | earthnail wrote:
           | The frequency domain is a misleading name. I assume you're
           | referring to STFT, or spectrogram, which is a series of
           | windowed time segments transformed into the frequency domain.
           | 
           | But both you and OP are right in that waveform domain is
           | usually called time domain.
        
       | munificent wrote:
       | The two piano examples where the second one has the phase
       | randomized is also an excellent example of why allpass filters,
       | which change the phase but not amplitude of all frequencies, are
       | a building block for digital reverbs. The second piano example
       | with the randomized phases sounds more blurred out and almost
       | reverb-y.
        
         | hammock wrote:
         | >change the phase but not amplitude of all frequencies
         | 
         | Kind of what walls do as well so it makes sense
        
       | anigbrowl wrote:
       | Why do people keep doing this? Musicians who want an
       | accompanist/virtual producer still want control over the
       | orchestration, tonality, and shaping of sounds. Even karaoke
       | machines use a signal pipeline to blend the singer's voice with
       | the backing track. Generating finished waveforms is only good for
       | elevator music.
        
         | kastnerkyle wrote:
         | Research into "pure" unconditional generation can often lead to
         | gains in the conditional setting. See literally any GAN
         | research, VQ-VAE, VAE, diffusion, etc - all started from
         | "unconditional/low information" pretty much. Both directly (in
         | terms of modeling) and indirectly (by forcing you to really
         | reason about what conditioning is telling you about the
         | modeling, and what's in the data), these approaches really
         | force you to think about what it means to just "make music".
         | 
         | Also, I think artistic uses (such as Dadabots, who heavily used
         | SampleRNN) show clearly that "musicians" like interesting
         | tools, even if uncontrolled in some cases. Tools to exactly
         | execute an idea are important (DAW-like), but so are novelty
         | generating machines like (many) unconditional generators end up
         | being. Jukebox is another nice example of this.
         | 
         | On the "good for elevator music" comment - the stuff I've heard
         | from these models is rarely relaxing enough to be in any
         | elevator I would ride. But there are snippets of inspiration in
         | there for sure.
         | 
         | Generally, I do favor controllable models with lots of input
         | knobs and conditioning for direct use, but there's space for
         | many different approaches in pushing the research forward.
         | 
         | Different creators will work all kind of odd models into their
         | workflows, even things that are objectively less "high
         | quality", and not really controllable. To me, that's a great
         | thing and reason enough to keep pushing unsupervised learning
         | forward.
        
         | alexahn wrote:
         | Seems like a reasonable way to try to design an AGI. Maybe the
         | real Turing test is whether an intelligent system enjoys and
         | seeks out novel music.
        
           | p1esk wrote:
           | A lot of humans would fail such a test.
        
             | datashaman wrote:
             | Most. Pop can't be novel.
        
         | coldtea wrote:
         | This is not even wrong.
         | 
         | From avant-garde and experimental to soundtracks and commercial
         | electronica, artistis in all kind of genres have used methods,
         | libraries and tools for direct generation of waveforms, whether
         | other processing will happen to them aftewards (manipulations,
         | effects, and so) or they're the final result (there's also a
         | big "generative music" scene as well, both academic and
         | artistic). And that's for decades now. Of course recently hany
         | have also started using AI's to produce generative music - with
         | the API spitting out a final "waveform".
         | 
         | > _Even karaoke machines use a signal pipeline to blend the
         | singer 's voice with the backing track. Generating finished
         | waveforms is only good for elevator music._
         | 
         | Perhaps you have the kind of music played at the Grand Ole Orpy
         | or something in mind.
         | 
         | Here are some trivial ways to use generated finished waveforms,
         | sticking with the AI case alone:
         | 
         | - take the AI final result, sample it, and use it as you would
         | loops from records or something like Splice.
         | 
         | - train the AI yourself, set parameters, tweek it, and the
         | result is generative music you've produced (a genre that exists
         | since the 60s at least, and is quite the opposite og "elevator
         | music")
         | 
         | - use the generated music as a soundtrack for your film or
         | video or video game
        
           | kastnerkyle wrote:
           | On the loops / sampling front: I always thought RAVE
           | [0][1][2] was a very interesting approach, that really
           | embraces latent spaces and sample/stretch type approaches in
           | the waveform space
           | 
           | [0] https://github.com/acids-ircam/RAVE?tab=readme-ov-file
           | 
           | [1] https://www.youtube.com/watch?v=dMZs04TzxUI
           | 
           | [2] https://www.youtube.com/watch?v=jAIRf4nGgYI
        
       | blt wrote:
       | People working on this problem: have diffusion models taken off
       | in this field too?
        
         | p1esk wrote:
         | Yes: https://github.com/riffusion/riffusion
        
         | brcmthrowaway wrote:
         | Any dummies explanation of what diffusion is?
        
           | benanne wrote:
           | I've since moved on to work primarily on diffusion models, so
           | I have a series of blog posts about that topic as well!
           | 
           | - https://sander.ai/2022/01/31/diffusion.html is about the
           | link between diffusion models and denoising autoencoders, IMO
           | the easiest to understand out of all interpretations; -
           | https://sander.ai/2023/07/20/perspectives.html covers a slew
           | of different perspectives on diffusion models (including the
           | "autoencoder" one).
           | 
           | In a nutshell, diffusion models break up the difficult task
           | of generating natural signals (such as images or sound) into
           | many smaller partial denoising tasks. This is done by
           | defining a corruption process that gradually adds noise to an
           | input until all of the signal is drowned out (this is the
           | "diffusion"), and then learning how to invert that process
           | step-by-step.
           | 
           | This is not dissimilar to how modern language models work:
           | they break up the task of generating text into a series of
           | easier next-word-prediction tasks. In both cases, the model
           | only solves a small part of the problem at a time, and you
           | apply it repeatedly to generate a signal.
        
       | albertzeyer wrote:
       | Since 2020, there are a number of new models, for example:
       | 
       | - Stable Audio: https://stability-ai.github.io/stable-audio-demo/
       | https://www.stableaudio.com/
       | 
       | - MusicGen: https://ai.honu.io/papers/musicgen/
       | 
       | - MusicLM: https://google-
       | research.github.io/seanet/musiclm/examples/
        
         | p1esk wrote:
         | But none of these is a significant improvement over Jukebox. I
         | think at this point everyone is waiting for Jukebox2.
        
           | GaggiX wrote:
           | SunoAI is very good in my opinion.
        
             | p1esk wrote:
             | I tried a dozen of different prompts in suno ai to generate
             | some music - it completely ignored them. It just generated
             | some simple pop sounding tunes every time. I'm not
             | impressed.
        
           | spyder wrote:
           | Stable-audio and MusicGen sounds better than Jukebox.
           | 
           | But the best so far is Suno.ai ( https://app.suno.ai )
           | especially with their V3 model they have very impressive
           | results, the fidelity is not studio quality but they're
           | getting very close.
           | 
           | It's very likely based on their TTS model they have released
           | before (Bark), but trained on more data and with higher
           | resolution.
           | 
           | https://github.com/suno-ai/bark
        
       | gorkish wrote:
       | For any other DSP people tearing their hair out at the author's
       | liberal terminology, whenever you see "wave" or "waveform" the
       | author is talking about non-quadrature time domain samples.
       | 
       | I feel this work would be a lot better if it was built around a
       | more foundational understanding of DSP before unleashing the
       | neural nets on data that is arguably the worst possible
       | representation of signal state. But then again there are people
       | training on the raw bitstream from a CCD camera instead of image
       | sequences so maybe hoping for an intermediate format that can be
       | understood by humans has no place in the future!
        
         | kajecounterhack wrote:
         | Aside from using "waveform" to mean "time domain" the
         | terminology the author used in the blog post is consistent with
         | what I've seen in audio ML research. In what ways would you
         | suggest improving the representation of the signal?
        
         | jabagonuts wrote:
         | Are you a musician? Have you ever used DAW like Cubase or Pro
         | Tools? If not, have you ever tried the FOSS (GPLv3) Audacity
         | audio editor [1]. Waves and Waveforms are colloquial
         | terminology, so the terms are familiar to anyone in the
         | industry as well as your average hobbyist.
         | 
         | Additionally, PCM [2] is at the heart of many of these tools,
         | and is what is converted between digital and analog for real-
         | world use cases.
         | 
         | This is literally how the ear works [3], so before arguing that
         | this is the "worst possible representation of signal state,"
         | try listening to the sounds around you and think about how it
         | is that you can perceive them.
         | 
         | [1] https://manual.audacityteam.org/man/audacity_waveform.html
         | [2] https://en.wikipedia.org/wiki/Pulse-code_modulation [3]
         | https://www.nidcd.nih.gov/health/how-do-we-hear
        
           | nick__m wrote:
           | According to your link the ear mostly work in the frequency
           | domain:                 Once the vibrations cause the fluid
           | inside the cochlea to ripple, a traveling wave forms along
           | the basilar membrane. Hair cells--sensory cells sitting on
           | top of the basilar membrane--ride the wave. Hair cells near
           | the wide end of the snail-shaped cochlea detect higher-
           | pitched sounds, such as an infant crying. Those closer to the
           | center detect lower-pitched sounds, such as a large dog
           | barking.
           | 
           | It's really far from PCM.
        
         | kastnerkyle wrote:
         | The direct counter-argument to "worst representation" is
         | usually "representation with fewest assumptions", waveform as
         | shown here is getting close. Though recording environment,
         | equipment, how the sound actually gets digitized, etc. also
         | come into play, there are relatively few assumptions in the
         | "waveform" setup described here.
         | 
         | I would say in the neural network literature at large, and in
         | audio modeling particularly, this continual back and forth of
         | pushing DSP-based knowledge into neural nets, on the
         | architecture side or data side, versus going "raw-er" to force
         | models to learn their own versions of DSP-style transforms has
         | been and will continue to be a see-saw, as we try to find what
         | works best, driven by performance on benchmarks with certain
         | goals in mind.
         | 
         | These types of push-pull movements also dominate computer
         | vision (where many of the "correct" DSP approaches fell away to
         | less-rigid, learned proxies), and language modeling
         | (tokenization is hardly "raw", and byte based approaches to-
         | date lag behind smart tokenization strategies), and I think
         | every field which approaches learning from data will have
         | various swings over time.
         | 
         | CCD bitstreams are also not "raw", so people will continue to
         | push down in representation while making bigger datasets and
         | models, and the rollercoaster will continue.
        
         | sevagh wrote:
         | What signal representation would you prefer they use? Waveform-
         | based models became popular generally _after_ STFT-based
         | models.
        
       ___________________________________________________________________
       (page generated 2024-03-26 23:00 UTC)