[HN Gopher] Show HN: Cloning a musical instrument from 16 second...
       ___________________________________________________________________
        
       Show HN: Cloning a musical instrument from 16 seconds of audio
        
       In 2020, Magenta released DDSP [1], a machine learning algorithm /
       python library which made it possible to generate good sounding
       instrument synthesizers from about 6-10 minutes of data. While
       working with DDSP for a project, we realised how it was actually
       quite hard to find 6-10 minute of clean recordings of monophonic
       instruments.  In this project, we have combined the DDSP
       architecture with a domain adaptation technique from speech
       synthesis [2]. This domain adaptation technique works by pre-
       training our model on many different recordings from the Solos
       dataset [3] first and then fine-tuning parts of the model to the
       new recording. This allows us to produce decent sounding instrument
       synthesisers from as little as 16 seconds of target audio instead
       of 6-10 minutes.  [1] https://arxiv.org/abs/2001.04643  [2]
       https://arxiv.org/abs/1802.06006  [3]
       https://arxiv.org/abs/2006.07931  We hope to publish a paper on the
       topic soon.
        
       Author : abdljasser2
       Score  : 82 points
       Date   : 2022-02-25 13:59 UTC (9 hours ago)
        
 (HTM) web link (erlj.notion.site)
 (TXT) w3m dump (erlj.notion.site)
        
       | michae2 wrote:
       | Wow, this seems like it could be bigger than Auto-Tune. I wonder
       | if we'll reach a point where artists license a DDSP model of
       | their instrument or voice, rather than perform directly.
        
         | birdyrooster wrote:
         | Well, I suppose a form of this technically has already
         | happened. Sakiko Fujita is the voice actor who was employed to
         | create the samples for the Vocaloid Hatsune Miku. The copyright
         | holder currently is SEGA.
        
       | ajross wrote:
       | While it's doing a great job emulating the timbre of the notes,
       | what's interesting to me is what it's _not_ doing. Play those top
       | two passages to anyone who 's spent time around sax players, and
       | they'll instantly tell you which one was the "fake".
       | 
       | Real instruments have mechanical behavior this isn't going to see
       | absent some kind of modeling. Different notes on a sax (to take
       | this example) are actuated by different fingers and different
       | valves and have different (often multiple) embouchures, and both
       | interact with the transitions between pairs of notes (and with
       | the dynamics with which they're played). All that complexity is
       | absolutely hearable in the transitions between notes, and the ML
       | layers aren't going to have the ability to pick it up absent a
       | _much_ larger training set.
       | 
       | Basically: 16 seconds of audio is enough to get you the frequency
       | spectrum of the notes, which you can do with 3-4 lines of
       | synthesis code in an imperative regime. It's very much the "easy
       | part" of instrument synthesis.
        
         | davio wrote:
         | I played sax through college and the second passage sounds
         | legit to me.
         | 
         | Couple of interesting things: - You can hear the keys and pads
         | hitting and clicking - Vibrato on the last note is very
         | realistic - Can hear the air and "spit" - The timing seems
         | human, especially as it slows down a little on the lower notes
         | which take more air and usually involve more awkward fingering
         | with the pinky fingers. Definitely sounds like a non-
         | professional, probably how an average high school player would
         | play the passage.
         | 
         | It sounds like what a player would hear versus a studio
         | recording. Reminds me of being in a really reflective & tiny
         | practice room.
        
       | smaddox wrote:
       | Very cool. I wonder if you could turn something like this into
       | the ultimate keyboard synth. That would be one hell of a product.
        
       | mkr-hn wrote:
       | While interesting, I'm skeptical an algorithmic approach will
       | ever come close to a decent wavetable-based synth (using
       | samples/noises from the real instrument) or scripted sampler
       | instrument (like Kontakt). It might help if you can't otherwise
       | find an existing synth/sample-based instrument but _can_ find it
       | in use, but those are few and far between.
        
       | nyanpasu64 wrote:
       | Does this work on synthesized sounds (like analog or physical
       | modeling rather than samplers)? How does it handle the complexity
       | of piano timbres (notoriously difficult to synthesize)?
        
       | [deleted]
        
       | samirsd wrote:
       | could this model be applied to mimic the characteristics of a
       | guitar amp or pedal for example?
        
       | cellover wrote:
       | Why in the name of God would a site prevent me from using
       | shortcut nav to go back in nav history?!
       | 
       | I can not use Cmd + left on a mac, is this a Notion thing?
        
         | capableweb wrote:
         | I didn't understand what you meant first, clicking on my back
         | arrow on the physical mouse, or the go back icon in the browser
         | and Firefox would go back like expected. Then I tried Alt+Left
         | arrow on the keyword, and it didn't work! But it wasn't just
         | not going back, like when you have a "too many
         | history.pushState()" bug that fucks it up for you. Instead it
         | went to the bottom of the page. Then I noticed it was selecting
         | some block in the bottom right and that it also happens when
         | you just press left arrow on the keyboard without Alt.
         | 
         | So TLDR: bug with some selection/focus thing on the Notion
         | page. Poor execution, I rate their implementation of "static
         | notebook HTML pages" 7/10.
        
           | cellover wrote:
           | Oh right I see, I did not have the time to debug when I
           | noticed this issue, thanks for the heads up!
           | 
           | Also sorry for parasiting the comments here, very interesting
           | research! It would have been interesting to record original
           | samples in the same settings; we can hear the flute having
           | much less reverb than the 1st saxophone and it could be
           | helpfup to have a common room size / reverb time to help
           | comparison.
        
       | ushakov wrote:
       | nice!
       | 
       | check out "Steerable discovery of neural audio effects" paper
       | 
       | https://csteinmetz1.github.io/steerable-nafx/
       | 
       | i'm compiling a list of research papers for our GuitarML project,
       | feel free to open a pull request/issue when your papers are
       | published!
       | 
       | https://github.com/GuitarML/mldsp-papers
        
       | skykooler wrote:
       | Have to say I'm disappointed with the chosen excerpts in the
       | "More examples" section - they do not show very much of the
       | generated instrument's ability (the last one is just a single
       | note!)
        
         | abdljasser2 wrote:
         | Thank you for the feedback! We will synthesize longer excerpts
         | in the future.
         | 
         | For the time being there is a colab where you can play with a
         | pretrained model.
        
           | squarefoot wrote:
           | Any chances this could turn into a MIDI playable instrument
           | with some pre-trained models plus the possibility to submit
           | user generated ones?
        
       | skybrian wrote:
       | Very interesting! Is there a preprint or demo that you forgot to
       | link to? Will you be releasing source or data?
       | 
       | I'm wondering what the hardware requirements would be for real-
       | time synthesis. I work on a musical instrument project as a hobby
       | and would like a good accordion sound.
        
         | abdljasser2 wrote:
         | Hello! Thank you for the interest. There are links to a colab
         | etc at the bottom of the blog post!
        
           | pieterhg wrote:
           | Where is the blog post?!
        
             | mkl wrote:
             | The main HN link: https://erlj.notion.site/Neural-
             | Instrument-Cloning-from-very...
        
       | radarsat1 wrote:
       | nice, i worked on something similar some years ago but this seems
       | more sophisticated. i had a lot of difficulty with atonal sounds
       | especially because i was trying to take a kind of "neural
       | wavetable" approach, but overall i found the results intriguing
       | nonetheless. curious to check out more recent efforts. i recall
       | the DDSP paper and assumed those ideas would apply quite nicely
       | to this problem.
       | 
       | [1] https://arxiv.org/abs/1806.09617
        
       | scrozier wrote:
       | But if one is trying to create a great sounding instrument
       | synthesizer, it is quite easy to _obtain_ 6-10 minutes of clean
       | recording. Why do you have to _find_ it? Not sure what the use
       | case is here...?
        
         | abdljasser2 wrote:
         | Hello!
         | 
         | Immediate use case would be sampling. Say you like a certain
         | sound in a song and would like to use it as a starting point
         | for your own sound patch.
         | 
         | I also believe that transfer learning has benefits even for
         | making great sounding instruments in cases where you have
         | access to lots of data. That's my intuition at least.
         | 
         | At the very least, it saves you a lot of memory/bandwith.
         | Instead of having one large model per instrument you only need
         | one large models with a few extra instrument specific weights.
        
           | al2o3cr wrote:
           | Say you like a certain sound in a song and would         like
           | to use it as a starting point for your own         sound
           | patch.
           | 
           | As a bonus, you might be the Lucky Winner of a copyright suit
           | that eventually establishes a whole new area of case law.
           | 
           | Yeah, it would be ridiculous and unreasonable - but so was "I
           | copyrighted these three notes in a row so pay me naow" :(
        
           | syntheweave wrote:
           | As a sample-user, it would be great to have this available in
           | the toolbox.
           | 
           | Just reusing the original recording of a sample is equivalent
           | to drawing a photorealistic tracing of an image: it
           | represents a ground truth, but it's not illustrated in any
           | particular artistic direction. And this makes the multisample
           | libraries available today akin to "dry references" - they can
           | be convincing as reproductions, some of the time, but you're
           | stitching them together like a collage of photos.
           | 
           | If you throw the sample into a synthesis engine you can push
           | around the parameters, crossfade it into a loop, add some
           | envelopes, modulation and layers, and make it a uniquely
           | stylized instrument, and this is one way to take the source
           | material to a new place by forgoing some realism.
           | 
           | Doing the synthesis through style transfer helps move it in a
           | different direction: it gets outside the bounds of directly
           | sequencing performance parameters and makes the performance a
           | little more like an effect, helping to glue the sound. And I
           | think that could be really cool if applied to arbitrary
           | source material.
        
           | dimal wrote:
           | Just want to chime in and say that I would love to have this
           | ability. Will you be adding more documentation on how a
           | knowledgeable user could use your library to accomplish this?
           | The docs are kinda sparse and I'm not sure how I could
           | actually use it.
        
           | willis936 wrote:
           | This reminds me of a task in my list that has been sitting
           | there for nearly a decade:                 instrument FIR
           | from song (justice - let there be light)
           | 
           | Here is the spectrogram of the sound I'm talking about:
           | 
           | https://imgur.com/kmtoMkd
           | 
           | It's pretty easy to filter out the drums since most of the
           | energy is in other bands. Looking at the spectrum again I
           | don't think a simple spectral replication will nail the sound
           | right. It looks like there is some sort of beat phenomenon
           | that isn't present at all center frequencies.
        
             | not1ofU wrote:
             | I dont think I understand what you mean, but, if I do, then
             | you could look into using spleeter. It separates musical
             | stems.
             | 
             | https://news.ycombinator.com/item?id=21431071
             | 
             | https://github.com/deezer/spleeter/wiki/2.-Getting-
             | started#u...
        
               | willis936 wrote:
               | The task I gave myself was to subtract out the drum beat
               | (the song graciously gives the isolated loop before the
               | instrument comes in), then mix/baseband the instrument to
               | whatever frequency I wanted. If all went well I would
               | make a complex FIR filter that I would pass tones into.
               | 
               | This model assumes the timbre is independent of the tone,
               | but I can see now that this assumption is quite wrong and
               | something more complicated (like this ML modeling) would
               | be needed.
        
           | capableweb wrote:
           | Where does the training start to "fade off" in terms of "time
           | spent" and "results achieved"? It seems 1 second vs 16
           | seconds have a dramatic change, but what about 50 seconds vs
           | 3600 seconds (1 hour)?
        
       ___________________________________________________________________
       (page generated 2022-02-25 23:00 UTC)