[HN Gopher] Application of Locality Sensitive Hashing to audio f...
___________________________________________________________________
Application of Locality Sensitive Hashing to audio fingerprinting
Author : pncnmnp
Score : 30 points
Date : 2024-04-01 11:46 UTC (1 days ago)
(HTM) web link (santhoshhari.github.io)
(TXT) w3m dump (santhoshhari.github.io)
| abetusk wrote:
| Does anyone have experience with how well this works?
| johanvts wrote:
| How well it works for identifying music is going to be an issue
| of how well you can construct the feature vector which is not
| covered. If you map every song in to the same small subset of
| space, LSH will struggle. Map them to well separated places and
| it will shine.
| bravura wrote:
| This is wrong, or misleading at best.
|
| It's a known difficult open problem to create a good audio
| representation (feature vector). Good, in the case of LSH,
| would be that representation distance correlates well with
| human perception.
|
| For something like Shazam, you want this representation to be
| invariant to minor transformations of the audio. That's the
| interesting problem.
| pncnmnp wrote:
| Check out Steve and Iran's website
| https://musicinformationretrieval.com/
|
| They have a bunch of Colab notebooks where they explore various
| music IR ideas, including LSH.
|
| GitHub:
| https://github.com/iranroman/musicinformationretrieval.com/
| bequanna wrote:
| Does anyone know how this approach differs from the approach
| taken by the popular(ish) dejavu project?
|
| https://github.com/worldveil/dejavu
| willseth wrote:
| Last I checked Dejavu doesn't use LSH or any approximations. It
| simply queries for exact matches, "aligns" them, and determines
| if there is enough signal from the noise to consider it a
| match.
| bequanna wrote:
| Ahh, I see.
|
| The approach here is just getting vector nearest neighbors.
|
| I guess I don't see how this could work like "Shazam" when
| you're comparing the vector of some sample to the vector for
| the whole song.
| willseth wrote:
| Oh, no, it's not for the whole song. For each song, you
| take N samples, each sample gets converted into a
| fingerprint, and all of the fingerprints are stored in a
| database. Then when you query, you perform the same sample-
| to-fingerprint computation, and query the database for
| matching fingerprints. The database will return many false
| positives, but only 1 will "align", meaning the relative
| timing of the fingerprints returned is in sync with an
| actual song in the database. Once the number of aligned
| fingerprints reaches a set threshold, you can stop and
| return it as a result. That's how it performs subset
| matching.
| bequanna wrote:
| Gotcha.
|
| So taking n overlapping samples for the song,
| fingerprinting then vectorizing each. Then doing the same
| thing for the captured sample and getting nearest
| neighbors.
|
| I've used dejavu for a personal project and it is quite
| fast. Is the LSH approach somehow more
| efficient/attractive now because we've made strides in
| vector nearest neighbors search?
| willseth wrote:
| Almost, but dejavu doesn't search nearest-neighbors. It
| searches for exact matches, so its robustness is
| essentially based on oversampling. Nearest neighbor
| search might be a way to achieve similar robustness with
| fewer fingerprints - assuming it results in fewer false
| negatives. LSH is one way to approach nearest neighbor
| search. So in theory you could modify dejavu with an LSH-
| based NN search and potentially increase robustness with
| less data, modulo whatever tradeoffs are necessary for
| computing NN search.
___________________________________________________________________
(page generated 2024-04-02 23:01 UTC)