[HN Gopher] Listening with LLM
___________________________________________________________________
Listening with LLM
Author : ppymou
Score : 73 points
Date : 2024-01-13 16:09 UTC (1 days ago)
(HTM) web link (paul.mou.dev)
(TXT) w3m dump (paul.mou.dev)
| asymmetric wrote:
| Very OT, but I love the style of your resume. Is the source
| available somewhere?
| ppymou wrote:
| Haha thanks; the html is on Github
| https://github.com/moomou/moomou.github.io/blob/master/resum...
| and from there you can see the imported css etc.; be warned
| though, the resume & css have been accumulated over the years
| so they are not particularly clean
| refulgentis wrote:
| If author is around: amazing work!!! Multimodal from scratch :)
|
| I'm curious if you have the test clip you use, I got to the end
| and was like "wait....is that a good result! The words are
| completely different!"
|
| Then I re-read a couple times scanning carefully for references
| to what the audio is.
|
| This quote[^1] makes me think the sample is music, as that would
| explain why the end result is good -- it's trying to describe a
| sound file of just music, not a sound file that is a spoken word
| version of the "ground truth":
|
| [^1] "For dataset, I chose MusicCaps. I did not see any
| convenient links to download processed/segmented audio files, so
| I wrote a small script to download the Youtube videos."
| ppymou wrote:
| Thanks for reading and yes you are right, the input audios are
| clips of music;
|
| MusicCaps [1] is a dataset containing pairs of music audio and
| natural language description of the clip; the reason why the
| result is good imo is because the trained model was able to
| generate a description with features of the ground truth
|
| [1] https://huggingface.co/datasets/google/MusicCaps
| modeless wrote:
| I love this research direction! Multimodal is the future and the
| possibilities of gluing together pretrained models are under
| explored. As tinkerers it's something we can do at home that
| doesn't require a datacenter full of H100s or a terabyte dataset.
|
| Crazy that you were able to trace your issues to bad RAM! I
| probably would have torn all my hair out long before suspecting
| bad RAM.
|
| I imagine that Whisper based embeddings wouldn't be great for
| analyzing music but they should be excellent for allowing LLMs to
| understand speech. Although it might seem trivial to hook up
| Whisper to LLMs already using text, I think using embeddings
| instead (or in addition) would allow the LLM to understand much
| more about speech. Cadence, tone, accent, etc. I think something
| like this will be necessary for speech agents in the medium term.
| It should allow a LLM to respond much more naturally to speech
| input, vs. just giving it the text output of a speech to text
| system. Maybe it could be done on the output side too, hooking it
| up to the internals of a text-to-speech system for an end-to-end
| audio-to-audio chatbot!
|
| Do you have a Twitter account or some other way to follow your
| progress?
___________________________________________________________________
(page generated 2024-01-14 23:00 UTC)