[HN Gopher] Show HN: Semantic Grep - A Word2Vec-powered search tool
___________________________________________________________________
Show HN: Semantic Grep - A Word2Vec-powered search tool
Much improved new version. Search for words similar to the query.
For example, "death" will find "death", "dying", "dead",
"killing"... Incredibly useful for exploring large text datasets
where exact matches are too restrictive.
Author : arunsupe
Score : 102 points
Date : 2024-07-27 18:02 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| sitkack wrote:
| Your post might have been flagged because of your example?
| arunsupe wrote:
| Oh.Please explain. The example was entirely arbitrary. Should I
| change it?
| threatofrain wrote:
| IMO no. There are enough people who will vouch.
| gunalx wrote:
| Really cool. Often just want to fuzzy search for a word, and this
| would be useful. Can it do filenames as well ? Or do I need to
| pipe something like LS first.
| fbdab103 wrote:
| I might have a work use case for which this would be perfect.
|
| Having no experience with word2vec, some reference performance
| numbers would be great. If I have one million PDF pages, how long
| is that going to take to encode? How long will it take to search?
| Is it CPU only or will I get a huge performance benefit if I have
| a GPU?
| 9dev wrote:
| As someone working extensively with word2vec: I would recommend
| to set up Elasticsearch. It has support for vector embeddings,
| so you can process your PDF documents once, write the word2vec
| embeddings and PDF metadata into an index, and search that in
| milliseconds later on. Doing live vectorisation is neat for
| exploring data, but using Elasticsearch will be much more
| convenient in actual products!
| michaelmior wrote:
| I would personally vote for Postgres and one of the many
| vector indexing extensions over Elasticsearch. I think
| Elasticsearch can be more challenging to maintain. Certainly
| a matter of opinion though. Elasticsearch is a very
| reasonable choice.
| SkyPuncher wrote:
| Vector stuff doesn't take much. It's essentially the same
| computational time as regular old text search.
| kristopolous wrote:
| No. Look at the implementation. Gpu isn't going to give you
| huge gains in this one
| onli wrote:
| That's totally clever and sound really useful. And it's one of
| those ideas where you go "Why didn't I think of that" when
| stumbling over the materials, word2vec in this case.
| pgroves wrote:
| How fast is it?
| drdeca wrote:
| Very cool!
|
| Do I understand correctly that this works by splitting each line
| into words, and using the embedding for each word?
|
| I wonder whether it might be feasible to search by semantics of
| longer sequences of text, using some language model (like, one of
| the smaller ones, like GPT2-small or something?). Like, so that
| if you were searching for "die", then "kick the bucket" and "buy
| the farm", could also match somehow? Though, I'm not sure what
| vector you would use to do the dot product with, when there is a
| sequence of tokens, each with associated key vectors for each
| head at each layer, rather than a single vector associated with a
| word.. Maybe one of the encoder-decoder models rather than the
| decoder only models?
|
| Though, for things like grep, one probably wants things to be
| very fast and as lightweight as feasible, which I imagine is much
| more the case with word vectors (as you have here) than it would
| be using a whole transformer model to produce the vectors.
|
| Maybe if one wanted to catch words that aren't separated
| correctly, one could detect if the line isn't comprised of well-
| separated words, and if so, find all words that appear as a
| substring of that line? Though maybe that would be too slow?
| throwawaydummy wrote:
| I wanna meet the person who greps die, kick the bucket and buy
| the farm lol
|
| Are models like mistral there yet in terms of token per second
| generation to run a grep over millions of files?
| randcraw wrote:
| This would be really useful if it could take a descriptive phrase
| or a compound phrase (like SQL 'select X and Y and Z') and match
| against the semantic cluster(s) that the query forms. IMO that's
| the greatest failing of today's search engines -- they're all one
| hit wonders.
| samatman wrote:
| This is a good idea. I'm going to offer some unsolicited feedback
| here:
|
| The configuration thing is unclear to me. I _think_ that
| "current directory" means "same directory as the binary", but it
| _could_ mean pwd.
|
| Neither of those is good: configuration doesn't belong where the
| binaries go, and it's obviously wrong to look for configs in the
| working directory.
|
| I suggest checking $XDG_CONFIG_HOME, and defaulting to
| `~/.config/sgrep/config.toml`.
|
| That extension is not a typo, btw. JSON is unpleasant to edit for
| configuration purposes, TOML is not.
|
| Or you could use an ENV variable directly, if the only thing that
| needs configuring is the model's location, that would be fine as
| well.
|
| If that were the on ramp, I'd be giving feedback on the program
| instead. I do think it's a clever idea and I'd like to try it
| out.
| rasengan0 wrote:
| very cool, led me to find
| https://www.cs.helsinki.fi/u/jjaakkol/sgrep.html and semgrep is
| taken so another symlink it is, w2vgrep?
| low_tech_punk wrote:
| not to be confused with https://github.com/semgrep/semgrep
___________________________________________________________________
(page generated 2024-07-27 23:00 UTC)