[HN Gopher] Listening with LLM
       ___________________________________________________________________
        
       Listening with LLM
        
       Author : ppymou
       Score  : 73 points
       Date   : 2024-01-13 16:09 UTC (1 days ago)
        
 (HTM) web link (paul.mou.dev)
 (TXT) w3m dump (paul.mou.dev)
        
       | asymmetric wrote:
       | Very OT, but I love the style of your resume. Is the source
       | available somewhere?
        
         | ppymou wrote:
         | Haha thanks; the html is on Github
         | https://github.com/moomou/moomou.github.io/blob/master/resum...
         | and from there you can see the imported css etc.; be warned
         | though, the resume & css have been accumulated over the years
         | so they are not particularly clean
        
       | refulgentis wrote:
       | If author is around: amazing work!!! Multimodal from scratch :)
       | 
       | I'm curious if you have the test clip you use, I got to the end
       | and was like "wait....is that a good result! The words are
       | completely different!"
       | 
       | Then I re-read a couple times scanning carefully for references
       | to what the audio is.
       | 
       | This quote[^1] makes me think the sample is music, as that would
       | explain why the end result is good -- it's trying to describe a
       | sound file of just music, not a sound file that is a spoken word
       | version of the "ground truth":
       | 
       | [^1] "For dataset, I chose MusicCaps. I did not see any
       | convenient links to download processed/segmented audio files, so
       | I wrote a small script to download the Youtube videos."
        
         | ppymou wrote:
         | Thanks for reading and yes you are right, the input audios are
         | clips of music;
         | 
         | MusicCaps [1] is a dataset containing pairs of music audio and
         | natural language description of the clip; the reason why the
         | result is good imo is because the trained model was able to
         | generate a description with features of the ground truth
         | 
         | [1] https://huggingface.co/datasets/google/MusicCaps
        
       | modeless wrote:
       | I love this research direction! Multimodal is the future and the
       | possibilities of gluing together pretrained models are under
       | explored. As tinkerers it's something we can do at home that
       | doesn't require a datacenter full of H100s or a terabyte dataset.
       | 
       | Crazy that you were able to trace your issues to bad RAM! I
       | probably would have torn all my hair out long before suspecting
       | bad RAM.
       | 
       | I imagine that Whisper based embeddings wouldn't be great for
       | analyzing music but they should be excellent for allowing LLMs to
       | understand speech. Although it might seem trivial to hook up
       | Whisper to LLMs already using text, I think using embeddings
       | instead (or in addition) would allow the LLM to understand much
       | more about speech. Cadence, tone, accent, etc. I think something
       | like this will be necessary for speech agents in the medium term.
       | It should allow a LLM to respond much more naturally to speech
       | input, vs. just giving it the text output of a speech to text
       | system. Maybe it could be done on the output side too, hooking it
       | up to the internals of a text-to-speech system for an end-to-end
       | audio-to-audio chatbot!
       | 
       | Do you have a Twitter account or some other way to follow your
       | progress?
        
       ___________________________________________________________________
       (page generated 2024-01-14 23:00 UTC)