[HN Gopher] Music Generation AI Models
___________________________________________________________________
Music Generation AI Models
Author : peab
Score : 32 points
Date : 2025-02-09 20:34 UTC (2 hours ago)
(HTM) web link (www.maximepeabody.com)
(TXT) w3m dump (www.maximepeabody.com)
| echelon wrote:
| > code is now being written with the help of LLMs, and almost all
| graphic design uses photoshop.
|
| AI models are tools, and engineers and artists should use them to
| do more per unit time.
|
| Text prompted final results are lame and boring, but complex
| workflows orchestrated by domain practitioners are incredible.
|
| We're entering an era where small teams will have big reach.
| Small studio movies will rival Pixar, electronic musicians will
| be able to conquer any genre, and indie game studios will take on
| AAA game releases.
|
| The problem will be discovery. There will be a long tail of
| content that caters to diverse audiences, but not everyone will
| make it.
| peab wrote:
| Yes well said. Distribution networks are hard to disrupt
| bayindirh wrote:
| > Small studio movies will rival Pixar...
|
| If you think Pixar is Pixar solely because they have an in-
| house software stack, you're missing the forest for a small
| shrub.
| echelon wrote:
| They're Pixar because these movies require hundreds of
| millions of dollars to make.
|
| Good writing and good directing don't need hundreds of
| millions of dollars.
| bayindirh wrote:
| Nope, they're Pixar because they pay insane amount of
| attention to detail. From every hair strand to every mimic.
| One can always notice something so minute but so powerful
| on every re-watch.
|
| That's what costs millions of dollars.
|
| Yes, they have an insane technology behind, but that's not
| what enables what they do. Humans enable it. Without human
| touch, that technology is just a glorified tech demo.
|
| We're still keen to underestimate what an human adds to the
| process. We became insane in the pursuit of efficiency.
| echelon wrote:
| I wholeheartedly disagree. Pixar does not have a monopoly
| on attention to detail. They're flush with cash and their
| leadership has decent taste.
|
| There are so many creators putting in intense work, and
| doing it on low budgets. You can't claim these folks
| don't have attention to detail. Check out A24, low and
| mid and low budget films, or independent films and you'll
| see a wide assortment of highly meticulous storytellers.
|
| Pixar, on the other hand, isn't low or mid budget:
| Toy Story - $30 Million A Bug's Life - $120
| Million Toy Story 2 - $90 Million
| Monsters, Inc. - $115 Million Finding Nemo -
| $94 Million The Incredibles - $92 Million
| Cars - $120 Million Ratatouille - $150
| Million WALL-E - $180 Million
| Up - $175 Million Toy Story 3 - $200 Million
| Cars 2 - $200 Million Brave - $185 Million
| Monsters University - $200 Million Inside
| Out - $175 Million The Good Dinosaur - $200
| Million Finding Dory - $200 Million
| Cars 3 - $175 Million Coco - $175 Million
| Incredibles 2 - $200 Million Toy Story 4 -
| $200 Million Onward - $175 Million
| Soul - $150 Million Luca - Unknown but
| probably around $150 Million Turning Red -
| $175 Million Lightyear - $200 Million
|
| For that amount of money, they had _better_ pay attention
| to detail.
|
| Miyazaki is doing way more with much less.
|
| Voices of a Distant Star was _one person_ -- Shinkai.
| That 's the kind of thing we'll see more and more of.
| Small creators reaching audiences and building studios.
| Gooseworx, psychicpebbles, Vivienne Medrano. That's the
| algorithm of tomorrow.
|
| AI, as a tool, makes this more possible. One of the first
| people to do it successfully was Joel Haver, and he's
| just the first of many to come.
| pier25 wrote:
| Are there models that generare MIDI instead of audio?
|
| IMO this would be much more useful.
| anigbrowl wrote:
| This. Generating audio en masse is everything that's wrong with
| LLMs, and people trying to use them this demonstrate a
| *fundamental misunderstanding of music. The whole attraction of
| music is separate generators in temporary harmony, whether
| rhythmic, tonal, timbral. Generating premixed streams of audio
| ('mixed' implying more than one voice or instrument) completely
| misses the point how music is constructed in the first place.
| Anyone advocating this approach is not worth listening to.
| ganoushoreilly wrote:
| While I mostly agree with you, we know that music is defined
| by the listener. Who are we to discern what is or isn't
| music? Do you have the same opinion of text or code generated
| by or with the assistance of AI?
| mdp2021 wrote:
| The poster presents criticism against an architectural
| model.
|
| > _Who are we to discern what is or isn 't music?_
|
| Hopefully, people with good judgement, potentially capable
| of evaluating products.
|
| The poster is clearly meaning "good music".
|
| > _Do you have the same opinion of text or code generated
| by or with the assistance of_
|
| There you go: the same way we note that some NN generated
| text is missing crucial qualities (e.g. intelligence), or
| that some NN generated images are missing crucial qualities
| (e.g. direction), you can surely admit the possibility that
| some NN generated sound may be missing relevant crucial
| qualities to the vetting of a good critic.
| ganoushoreilly wrote:
| What is Good music though? That's the whole point. Plenty
| of people listen to stuff I would consider weird and non
| music, but to them it is.
| mdp2021 wrote:
| Well if they call it "good music" because "they like it",
| that does not form a theory of music; whereas if they
| call it "good music" because they recognize it as an
| expression of good artistic form, and they are of
| promising judgement, than their theory could be
| translated into a generative architecture.
| ganoushoreilly wrote:
| It's up to the listener to apply whatever semantics they
| need to as justification. There is no purity test for
| music. The theory is just that theory.
| peab wrote:
| From the artist perspective, this is correct.
|
| But there are lots of applications for music which parallel
| the applications of ai generated images - things that are
| more commercial in nature. The media is functional, for use
| cases such as commercials, or social media type videos, where
| people just need something for the ambiance and don't want to
| deal with copyright or anything like that.
| xvector wrote:
| I don't really care about those fancy music theory terms.
|
| All that really matters is whether _users_ like what the
| generator generates
| bongodongobob wrote:
| I almost never use midi and beyond chord charts, none of the
| musicians I know write scores. No one is preventing you from
| creating in the way you like, get off your high horse. Do
| whatever makes you happy.
| mdp2021 wrote:
| I am not sure that the internal process could not work
| through conceiving <<temporary harmony[...] rhythmic, tonal,
| timbral [etc.]>>.
|
| Furthermore, the sound itself is crucial, so perfect
| calibration of a perfect sound is definitely a part of what
| can be clearly be sought (when you do not want to leave that
| to a secondary human process in the workflow).
| vunderba wrote:
| MuseNet by OpenAI used to allow you to do this - but OpenAI
| took it down over a year ago.
|
| https://openai.com/index/musenet
|
| Also, Synfire is a somewhat difficult to grok DAW designed
| around algorithmically generating midi motif as building blocks
| for longer pieces.
|
| https://www.youtube.com/watch?v=OrtJjEiWBtI
|
| It's not particularly well-known but it's been around for many
| years.
| verst wrote:
| Lots. For example, there are dozens of models that specifically
| have been trained on Bach MIDIs to generate new Bach style
| compositions. However, the generated MIDIs definitely do not
| sound like Bach :)
|
| I'd link to some specific examples (easy to Google or search on
| GitHub) but I can't recall which models were more successful
| than others.
| vunderba wrote:
| Almost nobody remembers it, but if you go back far enough,
| there was a Sid Meier game on the 3DO that algorithmically
| generated music in the style of Bach called (appropriately
| enough) CPU Bach.
|
| https://www.youtube.com/watch?v=nJkPWSKuTHI
| verst wrote:
| That's awesome! First time I've seen this. And
| coincidentally until today I had never even heard of the
| 3DO console. (I myself grew up on Amiga 500)
|
| Having taken a class on Bach style composition in college -
| I think a rules engine with a random seed would certainly
| be much more successful at generating Bach style
| compositions than any neural network-based model ever will
| be.
| vunderba wrote:
| I agree especially given how logically Bach structures
| his contrapuntal stuff. I also took a class on
| counterpoint and the professor had the great idea of
| using Gradus Ad Parnassum as our textbook. Very rewarding
| class but there's far more approachable books on
| counterpoint these days!
| verst wrote:
| Now I'm going down the rabbit hole of using a 3DO
| emulator (Opera) and running the CPU Bach ROM. :)
|
| And here is an interesting patent that Sid Meier and Jeff
| Briggs filed for their work on C.P.U. Bach: System for
| real-time music composition and synthesis
| https://patents.google.com/patent/US5496962A/en
| verst wrote:
| Update: Got it running with RetroArch 64 using the 3DO
| Company Opera core. Found the necessary BIOS to use here:
| https://github.com/trapexit/3do-bios
|
| I'll leave the ROM search up to whoever is interested :)
| tolciho wrote:
| Uh, "do not sound like Bach"? That's a regression from what
| David Cope was doing a few decades ago now.
| kadushka wrote:
| https://www.aiva.ai generates MIDI and provides editing UI.
| TheAceOfHearts wrote:
| One obvious area of improvement will be allowing you to tweak
| specific sections of an AI generated song. I was recently playing
| around with Suno, and while the results with their latest models
| are really impressive, sometimes you just want a little bit more
| control over specific sections of a track. To give a concrete
| example: I used deepseek-r1 to generate lyrics for a song about
| assabiyyah, and then used to Suno to generate the track [0]. The
| result was mostly fine, but it pronounced assabiyyah as ah-sa-BI-
| yah instead of ah-sah-BEE-yah. A relatively minor nitpick.
|
| [0] https://suno.com/song/0caf26e0-073e-4480-91c4-71ae79ec0497
| peab wrote:
| Yes. I anticipate that the open source models will pave the way
| for that, just like we have in painting with stable diffusion.
|
| Fundamentally, a song can be represented as a 2d image without
| any loss
| rubyn00bie wrote:
| Could you elaborate on this? I'm genuinely curious about how
| one would do that.
| o_____________o wrote:
| Suno has select region editing now
| vunderba wrote:
| From the article:
|
| _> Stem Splitting: This allows one to take an existing song, and
| split the audio into distinct tracks, such as vocals, guitar,
| drums and bass. Demucs by Meta is an AI model for stem
| splitting._
|
| +1 for Demucs (free and open source).
|
| Our band went back and used Demucs-GUI on a bunch of our really
| old pre-DAW stuff - all we had was the final WAVs and it did a
| really good job splitting out drums, piano, bass, vocals, etc.
| with the htdemucs_6s model. There was some slight bleed between
| some of the stems but other than that it was seamless.
|
| https://github.com/CarlGao4/Demucs-Gui
| verst wrote:
| I have used the htdemucs_6s a bunch, but I prefer the 4 stem
| model. The dedicated guitar and piano stems are usually full of
| really bad artifacts in the 6s model. It's still useful if you
| want to use it to transcribe the part to sheet music however.
| Just not useful to me in music production or as a backing
| track.
|
| My primary use is for creating backing tracks I can play piano
| / keyboard along with (just for fun in my home). Most of the
| time I'll just use the 4s model and will keep drums, bass and
| vocals.
| vunderba wrote:
| Yeah I could see that. We had better luck with the 6-stem,
| maybe it's because we had both rhythm and lead guitar in the
| mixes, but the 4-stem version didn't work as well for us.
| verst wrote:
| It probably also depends on the channel separation for the
| individual instruments in the final mix and any effects
| applied. A stereo chorus effect on one of the instruments
| can really interfere with the separation from other
| instruments from what I can tell.
|
| Piano (or various keys), organ and some guitars (with
| effects) have a lot of frequency overlap. The model
| struggles there.
| xvector wrote:
| In the future we may have music gen models that dynamically
| generate a soundtrack to our life, based off of ongoing events,
| emotions, etc. as well as our preferences.
|
| If this happens, main character syndrome may get a bit worse :)
| vunderba wrote:
| Slightly related, iMuse was an early example of an interactive
| music engine that mixed and matched audio to what was happening
| on-screen in a game.
|
| https://en.wikipedia.org/wiki/IMUSE
| ipsum2 wrote:
| I wonder if this article is AI generated.
|
| > Vocal Synthesis: This allows one to generate new audio that
| sounds like someone singing. One can write lyrics, as well as
| melody, and have the AI generate an audio that can match it. You
| could even specify how you want the voice to sound like. Google
| has also presented models capable of vocal synthesis, such as
| googlesingsong.
|
| Google's singsong paper does the exact opposite. Given human
| vocals, it produces an musical accompaniment.
| mdp2021 wrote:
| Given that Google is mentioned "out of the blue", that <<also>>
| seems to indicate that what was mistaken is '<<vocal>>': _[You
| can have vocal synthesis given music as an input, and] Google
| has also presented models capable of _music_ synthesis [given
| vocals as an input], such as googlesingsong_
| chaosprint wrote:
| I got into AI music back in 2017, kind of sparked by AlphaGo.
| Started by looking at machine listening stuff, like Nick Collins'
| work. Always been really curious about AI doing music live
| coding.
|
| In 2019, I built this thing called RaveForce
| [github.com/chaosprint/RaveForce]. It was a fun project.
|
| Back then, GANsynth was a big deal, looked amazing. But the sound
| quality... felt a bit lossy, you know? And MIDI generation, well,
| didn't really feel like "music generation" to me.
|
| Now, I'm thinking about these things differently. Maybe the sound
| quality thing is like MP3 at first, then it becomes "good enough"
| - like a "retina moment" for audio? Diffusion models seem to be
| pushing this idea too. And MIDI, if used the right way, could be
| a really powerful tool.
|
| Vocals synthesis and conversion are super cool. Feels like
| plugins, but next level. Really useful.
|
| But what I really want to see is AI understanding music from the
| ground up. Like, a robot learning how synth parameters work. Then
| we can do 8bit music like the DRL breakthrough. Not just training
| on tons of copyrighted music, making variations, and selling it,
| which is very cheap.
___________________________________________________________________
(page generated 2025-02-09 23:00 UTC)