[HN Gopher] Generating music in the waveform domain (2020)
___________________________________________________________________
Generating music in the waveform domain (2020)
Author : jszymborski
Score : 64 points
Date : 2024-03-26 14:38 UTC (8 hours ago)
(HTM) web link (sander.ai)
(TXT) w3m dump (sander.ai)
| tgv wrote:
| Isn't that usually called the time domain?
| DiogenesKynikos wrote:
| The author makes a distinction between two different modeling
| approaches:
|
| 1. Representing music as a series of notes (with additional
| information about dynamics, etc.), and then building a model
| that transforms this musical score into sound.
|
| 2. Modeling the waveform itself directly, without reference to
| a score. This waveform could presumably be in either the
| frequency or the time domain, but the author chooses to use the
| time domain.
|
| The author's terminology is a bit confusing, but I think that
| they mean option 2 when they say "the waveform domain."
| earthnail wrote:
| The frequency domain is a misleading name. I assume you're
| referring to STFT, or spectrogram, which is a series of
| windowed time segments transformed into the frequency domain.
|
| But both you and OP are right in that waveform domain is
| usually called time domain.
| munificent wrote:
| The two piano examples where the second one has the phase
| randomized is also an excellent example of why allpass filters,
| which change the phase but not amplitude of all frequencies, are
| a building block for digital reverbs. The second piano example
| with the randomized phases sounds more blurred out and almost
| reverb-y.
| hammock wrote:
| >change the phase but not amplitude of all frequencies
|
| Kind of what walls do as well so it makes sense
| anigbrowl wrote:
| Why do people keep doing this? Musicians who want an
| accompanist/virtual producer still want control over the
| orchestration, tonality, and shaping of sounds. Even karaoke
| machines use a signal pipeline to blend the singer's voice with
| the backing track. Generating finished waveforms is only good for
| elevator music.
| kastnerkyle wrote:
| Research into "pure" unconditional generation can often lead to
| gains in the conditional setting. See literally any GAN
| research, VQ-VAE, VAE, diffusion, etc - all started from
| "unconditional/low information" pretty much. Both directly (in
| terms of modeling) and indirectly (by forcing you to really
| reason about what conditioning is telling you about the
| modeling, and what's in the data), these approaches really
| force you to think about what it means to just "make music".
|
| Also, I think artistic uses (such as Dadabots, who heavily used
| SampleRNN) show clearly that "musicians" like interesting
| tools, even if uncontrolled in some cases. Tools to exactly
| execute an idea are important (DAW-like), but so are novelty
| generating machines like (many) unconditional generators end up
| being. Jukebox is another nice example of this.
|
| On the "good for elevator music" comment - the stuff I've heard
| from these models is rarely relaxing enough to be in any
| elevator I would ride. But there are snippets of inspiration in
| there for sure.
|
| Generally, I do favor controllable models with lots of input
| knobs and conditioning for direct use, but there's space for
| many different approaches in pushing the research forward.
|
| Different creators will work all kind of odd models into their
| workflows, even things that are objectively less "high
| quality", and not really controllable. To me, that's a great
| thing and reason enough to keep pushing unsupervised learning
| forward.
| alexahn wrote:
| Seems like a reasonable way to try to design an AGI. Maybe the
| real Turing test is whether an intelligent system enjoys and
| seeks out novel music.
| p1esk wrote:
| A lot of humans would fail such a test.
| datashaman wrote:
| Most. Pop can't be novel.
| coldtea wrote:
| This is not even wrong.
|
| From avant-garde and experimental to soundtracks and commercial
| electronica, artistis in all kind of genres have used methods,
| libraries and tools for direct generation of waveforms, whether
| other processing will happen to them aftewards (manipulations,
| effects, and so) or they're the final result (there's also a
| big "generative music" scene as well, both academic and
| artistic). And that's for decades now. Of course recently hany
| have also started using AI's to produce generative music - with
| the API spitting out a final "waveform".
|
| > _Even karaoke machines use a signal pipeline to blend the
| singer 's voice with the backing track. Generating finished
| waveforms is only good for elevator music._
|
| Perhaps you have the kind of music played at the Grand Ole Orpy
| or something in mind.
|
| Here are some trivial ways to use generated finished waveforms,
| sticking with the AI case alone:
|
| - take the AI final result, sample it, and use it as you would
| loops from records or something like Splice.
|
| - train the AI yourself, set parameters, tweek it, and the
| result is generative music you've produced (a genre that exists
| since the 60s at least, and is quite the opposite og "elevator
| music")
|
| - use the generated music as a soundtrack for your film or
| video or video game
| kastnerkyle wrote:
| On the loops / sampling front: I always thought RAVE
| [0][1][2] was a very interesting approach, that really
| embraces latent spaces and sample/stretch type approaches in
| the waveform space
|
| [0] https://github.com/acids-ircam/RAVE?tab=readme-ov-file
|
| [1] https://www.youtube.com/watch?v=dMZs04TzxUI
|
| [2] https://www.youtube.com/watch?v=jAIRf4nGgYI
| blt wrote:
| People working on this problem: have diffusion models taken off
| in this field too?
| p1esk wrote:
| Yes: https://github.com/riffusion/riffusion
| brcmthrowaway wrote:
| Any dummies explanation of what diffusion is?
| benanne wrote:
| I've since moved on to work primarily on diffusion models, so
| I have a series of blog posts about that topic as well!
|
| - https://sander.ai/2022/01/31/diffusion.html is about the
| link between diffusion models and denoising autoencoders, IMO
| the easiest to understand out of all interpretations; -
| https://sander.ai/2023/07/20/perspectives.html covers a slew
| of different perspectives on diffusion models (including the
| "autoencoder" one).
|
| In a nutshell, diffusion models break up the difficult task
| of generating natural signals (such as images or sound) into
| many smaller partial denoising tasks. This is done by
| defining a corruption process that gradually adds noise to an
| input until all of the signal is drowned out (this is the
| "diffusion"), and then learning how to invert that process
| step-by-step.
|
| This is not dissimilar to how modern language models work:
| they break up the task of generating text into a series of
| easier next-word-prediction tasks. In both cases, the model
| only solves a small part of the problem at a time, and you
| apply it repeatedly to generate a signal.
| albertzeyer wrote:
| Since 2020, there are a number of new models, for example:
|
| - Stable Audio: https://stability-ai.github.io/stable-audio-demo/
| https://www.stableaudio.com/
|
| - MusicGen: https://ai.honu.io/papers/musicgen/
|
| - MusicLM: https://google-
| research.github.io/seanet/musiclm/examples/
| p1esk wrote:
| But none of these is a significant improvement over Jukebox. I
| think at this point everyone is waiting for Jukebox2.
| GaggiX wrote:
| SunoAI is very good in my opinion.
| p1esk wrote:
| I tried a dozen of different prompts in suno ai to generate
| some music - it completely ignored them. It just generated
| some simple pop sounding tunes every time. I'm not
| impressed.
| spyder wrote:
| Stable-audio and MusicGen sounds better than Jukebox.
|
| But the best so far is Suno.ai ( https://app.suno.ai )
| especially with their V3 model they have very impressive
| results, the fidelity is not studio quality but they're
| getting very close.
|
| It's very likely based on their TTS model they have released
| before (Bark), but trained on more data and with higher
| resolution.
|
| https://github.com/suno-ai/bark
| gorkish wrote:
| For any other DSP people tearing their hair out at the author's
| liberal terminology, whenever you see "wave" or "waveform" the
| author is talking about non-quadrature time domain samples.
|
| I feel this work would be a lot better if it was built around a
| more foundational understanding of DSP before unleashing the
| neural nets on data that is arguably the worst possible
| representation of signal state. But then again there are people
| training on the raw bitstream from a CCD camera instead of image
| sequences so maybe hoping for an intermediate format that can be
| understood by humans has no place in the future!
| kajecounterhack wrote:
| Aside from using "waveform" to mean "time domain" the
| terminology the author used in the blog post is consistent with
| what I've seen in audio ML research. In what ways would you
| suggest improving the representation of the signal?
| jabagonuts wrote:
| Are you a musician? Have you ever used DAW like Cubase or Pro
| Tools? If not, have you ever tried the FOSS (GPLv3) Audacity
| audio editor [1]. Waves and Waveforms are colloquial
| terminology, so the terms are familiar to anyone in the
| industry as well as your average hobbyist.
|
| Additionally, PCM [2] is at the heart of many of these tools,
| and is what is converted between digital and analog for real-
| world use cases.
|
| This is literally how the ear works [3], so before arguing that
| this is the "worst possible representation of signal state,"
| try listening to the sounds around you and think about how it
| is that you can perceive them.
|
| [1] https://manual.audacityteam.org/man/audacity_waveform.html
| [2] https://en.wikipedia.org/wiki/Pulse-code_modulation [3]
| https://www.nidcd.nih.gov/health/how-do-we-hear
| nick__m wrote:
| According to your link the ear mostly work in the frequency
| domain: Once the vibrations cause the fluid
| inside the cochlea to ripple, a traveling wave forms along
| the basilar membrane. Hair cells--sensory cells sitting on
| top of the basilar membrane--ride the wave. Hair cells near
| the wide end of the snail-shaped cochlea detect higher-
| pitched sounds, such as an infant crying. Those closer to the
| center detect lower-pitched sounds, such as a large dog
| barking.
|
| It's really far from PCM.
| kastnerkyle wrote:
| The direct counter-argument to "worst representation" is
| usually "representation with fewest assumptions", waveform as
| shown here is getting close. Though recording environment,
| equipment, how the sound actually gets digitized, etc. also
| come into play, there are relatively few assumptions in the
| "waveform" setup described here.
|
| I would say in the neural network literature at large, and in
| audio modeling particularly, this continual back and forth of
| pushing DSP-based knowledge into neural nets, on the
| architecture side or data side, versus going "raw-er" to force
| models to learn their own versions of DSP-style transforms has
| been and will continue to be a see-saw, as we try to find what
| works best, driven by performance on benchmarks with certain
| goals in mind.
|
| These types of push-pull movements also dominate computer
| vision (where many of the "correct" DSP approaches fell away to
| less-rigid, learned proxies), and language modeling
| (tokenization is hardly "raw", and byte based approaches to-
| date lag behind smart tokenization strategies), and I think
| every field which approaches learning from data will have
| various swings over time.
|
| CCD bitstreams are also not "raw", so people will continue to
| push down in representation while making bigger datasets and
| models, and the rollercoaster will continue.
| sevagh wrote:
| What signal representation would you prefer they use? Waveform-
| based models became popular generally _after_ STFT-based
| models.
___________________________________________________________________
(page generated 2024-03-26 23:00 UTC)