[HN Gopher] Music Generation AI Models
       ___________________________________________________________________
        
       Music Generation AI Models
        
       Author : peab
       Score  : 32 points
       Date   : 2025-02-09 20:34 UTC (2 hours ago)
        
 (HTM) web link (www.maximepeabody.com)
 (TXT) w3m dump (www.maximepeabody.com)
        
       | echelon wrote:
       | > code is now being written with the help of LLMs, and almost all
       | graphic design uses photoshop.
       | 
       | AI models are tools, and engineers and artists should use them to
       | do more per unit time.
       | 
       | Text prompted final results are lame and boring, but complex
       | workflows orchestrated by domain practitioners are incredible.
       | 
       | We're entering an era where small teams will have big reach.
       | Small studio movies will rival Pixar, electronic musicians will
       | be able to conquer any genre, and indie game studios will take on
       | AAA game releases.
       | 
       | The problem will be discovery. There will be a long tail of
       | content that caters to diverse audiences, but not everyone will
       | make it.
        
         | peab wrote:
         | Yes well said. Distribution networks are hard to disrupt
        
         | bayindirh wrote:
         | > Small studio movies will rival Pixar...
         | 
         | If you think Pixar is Pixar solely because they have an in-
         | house software stack, you're missing the forest for a small
         | shrub.
        
           | echelon wrote:
           | They're Pixar because these movies require hundreds of
           | millions of dollars to make.
           | 
           | Good writing and good directing don't need hundreds of
           | millions of dollars.
        
             | bayindirh wrote:
             | Nope, they're Pixar because they pay insane amount of
             | attention to detail. From every hair strand to every mimic.
             | One can always notice something so minute but so powerful
             | on every re-watch.
             | 
             | That's what costs millions of dollars.
             | 
             | Yes, they have an insane technology behind, but that's not
             | what enables what they do. Humans enable it. Without human
             | touch, that technology is just a glorified tech demo.
             | 
             | We're still keen to underestimate what an human adds to the
             | process. We became insane in the pursuit of efficiency.
        
               | echelon wrote:
               | I wholeheartedly disagree. Pixar does not have a monopoly
               | on attention to detail. They're flush with cash and their
               | leadership has decent taste.
               | 
               | There are so many creators putting in intense work, and
               | doing it on low budgets. You can't claim these folks
               | don't have attention to detail. Check out A24, low and
               | mid and low budget films, or independent films and you'll
               | see a wide assortment of highly meticulous storytellers.
               | 
               | Pixar, on the other hand, isn't low or mid budget:
               | Toy Story - $30 Million              A Bug's Life - $120
               | Million              Toy Story 2 - $90 Million
               | Monsters, Inc. - $115 Million              Finding Nemo -
               | $94 Million              The Incredibles - $92 Million
               | Cars - $120 Million              Ratatouille - $150
               | Million              WALL-E - $180 Million
               | Up - $175 Million              Toy Story 3 - $200 Million
               | Cars 2 - $200 Million              Brave - $185 Million
               | Monsters University - $200 Million              Inside
               | Out - $175 Million              The Good Dinosaur - $200
               | Million              Finding Dory - $200 Million
               | Cars 3 - $175 Million              Coco - $175 Million
               | Incredibles 2 - $200 Million              Toy Story 4 -
               | $200 Million              Onward - $175 Million
               | Soul - $150 Million              Luca - Unknown but
               | probably around $150 Million              Turning Red -
               | $175 Million              Lightyear - $200 Million
               | 
               | For that amount of money, they had _better_ pay attention
               | to detail.
               | 
               | Miyazaki is doing way more with much less.
               | 
               | Voices of a Distant Star was _one person_ -- Shinkai.
               | That 's the kind of thing we'll see more and more of.
               | Small creators reaching audiences and building studios.
               | Gooseworx, psychicpebbles, Vivienne Medrano. That's the
               | algorithm of tomorrow.
               | 
               | AI, as a tool, makes this more possible. One of the first
               | people to do it successfully was Joel Haver, and he's
               | just the first of many to come.
        
       | pier25 wrote:
       | Are there models that generare MIDI instead of audio?
       | 
       | IMO this would be much more useful.
        
         | anigbrowl wrote:
         | This. Generating audio en masse is everything that's wrong with
         | LLMs, and people trying to use them this demonstrate a
         | *fundamental misunderstanding of music. The whole attraction of
         | music is separate generators in temporary harmony, whether
         | rhythmic, tonal, timbral. Generating premixed streams of audio
         | ('mixed' implying more than one voice or instrument) completely
         | misses the point how music is constructed in the first place.
         | Anyone advocating this approach is not worth listening to.
        
           | ganoushoreilly wrote:
           | While I mostly agree with you, we know that music is defined
           | by the listener. Who are we to discern what is or isn't
           | music? Do you have the same opinion of text or code generated
           | by or with the assistance of AI?
        
             | mdp2021 wrote:
             | The poster presents criticism against an architectural
             | model.
             | 
             | > _Who are we to discern what is or isn 't music?_
             | 
             | Hopefully, people with good judgement, potentially capable
             | of evaluating products.
             | 
             | The poster is clearly meaning "good music".
             | 
             | > _Do you have the same opinion of text or code generated
             | by or with the assistance of_
             | 
             | There you go: the same way we note that some NN generated
             | text is missing crucial qualities (e.g. intelligence), or
             | that some NN generated images are missing crucial qualities
             | (e.g. direction), you can surely admit the possibility that
             | some NN generated sound may be missing relevant crucial
             | qualities to the vetting of a good critic.
        
               | ganoushoreilly wrote:
               | What is Good music though? That's the whole point. Plenty
               | of people listen to stuff I would consider weird and non
               | music, but to them it is.
        
               | mdp2021 wrote:
               | Well if they call it "good music" because "they like it",
               | that does not form a theory of music; whereas if they
               | call it "good music" because they recognize it as an
               | expression of good artistic form, and they are of
               | promising judgement, than their theory could be
               | translated into a generative architecture.
        
               | ganoushoreilly wrote:
               | It's up to the listener to apply whatever semantics they
               | need to as justification. There is no purity test for
               | music. The theory is just that theory.
        
           | peab wrote:
           | From the artist perspective, this is correct.
           | 
           | But there are lots of applications for music which parallel
           | the applications of ai generated images - things that are
           | more commercial in nature. The media is functional, for use
           | cases such as commercials, or social media type videos, where
           | people just need something for the ambiance and don't want to
           | deal with copyright or anything like that.
        
           | xvector wrote:
           | I don't really care about those fancy music theory terms.
           | 
           | All that really matters is whether _users_ like what the
           | generator generates
        
           | bongodongobob wrote:
           | I almost never use midi and beyond chord charts, none of the
           | musicians I know write scores. No one is preventing you from
           | creating in the way you like, get off your high horse. Do
           | whatever makes you happy.
        
           | mdp2021 wrote:
           | I am not sure that the internal process could not work
           | through conceiving <<temporary harmony[...] rhythmic, tonal,
           | timbral [etc.]>>.
           | 
           | Furthermore, the sound itself is crucial, so perfect
           | calibration of a perfect sound is definitely a part of what
           | can be clearly be sought (when you do not want to leave that
           | to a secondary human process in the workflow).
        
         | vunderba wrote:
         | MuseNet by OpenAI used to allow you to do this - but OpenAI
         | took it down over a year ago.
         | 
         | https://openai.com/index/musenet
         | 
         | Also, Synfire is a somewhat difficult to grok DAW designed
         | around algorithmically generating midi motif as building blocks
         | for longer pieces.
         | 
         | https://www.youtube.com/watch?v=OrtJjEiWBtI
         | 
         | It's not particularly well-known but it's been around for many
         | years.
        
         | verst wrote:
         | Lots. For example, there are dozens of models that specifically
         | have been trained on Bach MIDIs to generate new Bach style
         | compositions. However, the generated MIDIs definitely do not
         | sound like Bach :)
         | 
         | I'd link to some specific examples (easy to Google or search on
         | GitHub) but I can't recall which models were more successful
         | than others.
        
           | vunderba wrote:
           | Almost nobody remembers it, but if you go back far enough,
           | there was a Sid Meier game on the 3DO that algorithmically
           | generated music in the style of Bach called (appropriately
           | enough) CPU Bach.
           | 
           | https://www.youtube.com/watch?v=nJkPWSKuTHI
        
             | verst wrote:
             | That's awesome! First time I've seen this. And
             | coincidentally until today I had never even heard of the
             | 3DO console. (I myself grew up on Amiga 500)
             | 
             | Having taken a class on Bach style composition in college -
             | I think a rules engine with a random seed would certainly
             | be much more successful at generating Bach style
             | compositions than any neural network-based model ever will
             | be.
        
               | vunderba wrote:
               | I agree especially given how logically Bach structures
               | his contrapuntal stuff. I also took a class on
               | counterpoint and the professor had the great idea of
               | using Gradus Ad Parnassum as our textbook. Very rewarding
               | class but there's far more approachable books on
               | counterpoint these days!
        
               | verst wrote:
               | Now I'm going down the rabbit hole of using a 3DO
               | emulator (Opera) and running the CPU Bach ROM. :)
               | 
               | And here is an interesting patent that Sid Meier and Jeff
               | Briggs filed for their work on C.P.U. Bach: System for
               | real-time music composition and synthesis
               | https://patents.google.com/patent/US5496962A/en
        
               | verst wrote:
               | Update: Got it running with RetroArch 64 using the 3DO
               | Company Opera core. Found the necessary BIOS to use here:
               | https://github.com/trapexit/3do-bios
               | 
               | I'll leave the ROM search up to whoever is interested :)
        
           | tolciho wrote:
           | Uh, "do not sound like Bach"? That's a regression from what
           | David Cope was doing a few decades ago now.
        
         | kadushka wrote:
         | https://www.aiva.ai generates MIDI and provides editing UI.
        
       | TheAceOfHearts wrote:
       | One obvious area of improvement will be allowing you to tweak
       | specific sections of an AI generated song. I was recently playing
       | around with Suno, and while the results with their latest models
       | are really impressive, sometimes you just want a little bit more
       | control over specific sections of a track. To give a concrete
       | example: I used deepseek-r1 to generate lyrics for a song about
       | assabiyyah, and then used to Suno to generate the track [0]. The
       | result was mostly fine, but it pronounced assabiyyah as ah-sa-BI-
       | yah instead of ah-sah-BEE-yah. A relatively minor nitpick.
       | 
       | [0] https://suno.com/song/0caf26e0-073e-4480-91c4-71ae79ec0497
        
         | peab wrote:
         | Yes. I anticipate that the open source models will pave the way
         | for that, just like we have in painting with stable diffusion.
         | 
         | Fundamentally, a song can be represented as a 2d image without
         | any loss
        
           | rubyn00bie wrote:
           | Could you elaborate on this? I'm genuinely curious about how
           | one would do that.
        
         | o_____________o wrote:
         | Suno has select region editing now
        
       | vunderba wrote:
       | From the article:
       | 
       |  _> Stem Splitting: This allows one to take an existing song, and
       | split the audio into distinct tracks, such as vocals, guitar,
       | drums and bass. Demucs by Meta is an AI model for stem
       | splitting._
       | 
       | +1 for Demucs (free and open source).
       | 
       | Our band went back and used Demucs-GUI on a bunch of our really
       | old pre-DAW stuff - all we had was the final WAVs and it did a
       | really good job splitting out drums, piano, bass, vocals, etc.
       | with the htdemucs_6s model. There was some slight bleed between
       | some of the stems but other than that it was seamless.
       | 
       | https://github.com/CarlGao4/Demucs-Gui
        
         | verst wrote:
         | I have used the htdemucs_6s a bunch, but I prefer the 4 stem
         | model. The dedicated guitar and piano stems are usually full of
         | really bad artifacts in the 6s model. It's still useful if you
         | want to use it to transcribe the part to sheet music however.
         | Just not useful to me in music production or as a backing
         | track.
         | 
         | My primary use is for creating backing tracks I can play piano
         | / keyboard along with (just for fun in my home). Most of the
         | time I'll just use the 4s model and will keep drums, bass and
         | vocals.
        
           | vunderba wrote:
           | Yeah I could see that. We had better luck with the 6-stem,
           | maybe it's because we had both rhythm and lead guitar in the
           | mixes, but the 4-stem version didn't work as well for us.
        
             | verst wrote:
             | It probably also depends on the channel separation for the
             | individual instruments in the final mix and any effects
             | applied. A stereo chorus effect on one of the instruments
             | can really interfere with the separation from other
             | instruments from what I can tell.
             | 
             | Piano (or various keys), organ and some guitars (with
             | effects) have a lot of frequency overlap. The model
             | struggles there.
        
       | xvector wrote:
       | In the future we may have music gen models that dynamically
       | generate a soundtrack to our life, based off of ongoing events,
       | emotions, etc. as well as our preferences.
       | 
       | If this happens, main character syndrome may get a bit worse :)
        
         | vunderba wrote:
         | Slightly related, iMuse was an early example of an interactive
         | music engine that mixed and matched audio to what was happening
         | on-screen in a game.
         | 
         | https://en.wikipedia.org/wiki/IMUSE
        
       | ipsum2 wrote:
       | I wonder if this article is AI generated.
       | 
       | > Vocal Synthesis: This allows one to generate new audio that
       | sounds like someone singing. One can write lyrics, as well as
       | melody, and have the AI generate an audio that can match it. You
       | could even specify how you want the voice to sound like. Google
       | has also presented models capable of vocal synthesis, such as
       | googlesingsong.
       | 
       | Google's singsong paper does the exact opposite. Given human
       | vocals, it produces an musical accompaniment.
        
         | mdp2021 wrote:
         | Given that Google is mentioned "out of the blue", that <<also>>
         | seems to indicate that what was mistaken is '<<vocal>>': _[You
         | can have vocal synthesis given music as an input, and] Google
         | has also presented models capable of _music_ synthesis [given
         | vocals as an input], such as googlesingsong_
        
       | chaosprint wrote:
       | I got into AI music back in 2017, kind of sparked by AlphaGo.
       | Started by looking at machine listening stuff, like Nick Collins'
       | work. Always been really curious about AI doing music live
       | coding.
       | 
       | In 2019, I built this thing called RaveForce
       | [github.com/chaosprint/RaveForce]. It was a fun project.
       | 
       | Back then, GANsynth was a big deal, looked amazing. But the sound
       | quality... felt a bit lossy, you know? And MIDI generation, well,
       | didn't really feel like "music generation" to me.
       | 
       | Now, I'm thinking about these things differently. Maybe the sound
       | quality thing is like MP3 at first, then it becomes "good enough"
       | - like a "retina moment" for audio? Diffusion models seem to be
       | pushing this idea too. And MIDI, if used the right way, could be
       | a really powerful tool.
       | 
       | Vocals synthesis and conversion are super cool. Feels like
       | plugins, but next level. Really useful.
       | 
       | But what I really want to see is AI understanding music from the
       | ground up. Like, a robot learning how synth parameters work. Then
       | we can do 8bit music like the DRL breakthrough. Not just training
       | on tons of copyrighted music, making variations, and selling it,
       | which is very cheap.
        
       ___________________________________________________________________
       (page generated 2025-02-09 23:00 UTC)