[HN Gopher] Show HN: SeaGOAT - local, "AI-based" grep for semant...
___________________________________________________________________
Show HN: SeaGOAT - local, "AI-based" grep for semantic code search
Author : kantord
Score : 194 points
Date : 2023-09-20 12:13 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| ithkuil wrote:
| Interesting.
|
| What would it take to support other programming languages?
| jarulraj wrote:
| Neat AI app!
|
| 1. What feature extractor is used to derive code embeddings?
|
| 2. Would support for more complex queries be useful inside the
| app? --- Retrieve a subset of code snippets
| SELECT name FROM snippets WHERE file_name LIKE
| "%py" AND author_name LIKE "John%" ORDER BY
| Similarity( CodeFeatureExtractor(Open(query)),
| CodeFeatureExtractor(data) ) LIMIT 5;
| kantord wrote:
| embeddings are done using ChromaDB
|
| support for more complex queries could be useful, but probably
| not using a query language since that would make it more
| difficult to use free-form text input.
|
| You can already use it using an API:
| https://kantord.github.io/SeaGOAT/0.27.x/server/#understandi...
| so probably the best way to add support for more complex
| queries would be to have additional query parameters, and also
| to expose those flags/options/features through the CLI
| kantord wrote:
| btw I am also working on a web version of it that will allow
| you to search in multiple repositories at the same time and
| you will be able to self host it at work, or run it locally
| in your machine. https://github.com/kantord/SeaGOAT-web
|
| so that could provide a nicer interactive experience for more
| complex queries
| dylanjcastillo wrote:
| For those curious about it, ChromaDB uses all-MiniLM-L6-v2[0]
| from Sentence Transformers[1] by default.
|
| [0] https://docs.trychroma.com/embeddings#default-all-
| minilm-l6-...
|
| [1] https://www.sbert.net/docs/pretrained_models.html
| jasonjmcghee wrote:
| I've been test driving a similar one https://github.com/sturdy-
| dev/semantic-code-search
|
| But yours has a more permissive license!
|
| I also had to modify it a bit to allow for the line endings I
| needed and it frustratingly doesn't allow specifying a path, and
| often returns tests instead of code
| [deleted]
| MisterTea wrote:
| Is the naming coincidence or some sort of strange homage because
| I can't help thinking GOATsea.
| hollowpython wrote:
| Does anyone know a tool like this but for arbitrary PDFs?
| freedmand wrote:
| Semantra! Shared it yesterday on HN
| https://github.com/freedmand/semantra
| freckletonj wrote:
| If you're ok working in a text editor, UniteAI works on pdfs,
| youtube transcripts, code repos, web pages, local documents,
| etc. The nice thing about the editor is once it's done
| retrieval, you can hit another keycombo to send retrieved
| passages to an LLM (local, or chatgpt), and ask questions or
| favors about it (such as summarization, or formatting changes).
|
| https://github.com/freckletonj/uniteai
| kantord wrote:
| btw pdf support could probably be added to seagoat itself by
| adding a layer that translates the pdf files to text files and
| probably some added changed to make sure that the page number
| is also included in the results
| smoe wrote:
| Looks very neat! Currently processing the repo I'm working on.
|
| Can the generated database be easily shared within the team so
| not everyone has to run the initial processing of the repo which
| seems that it will take a couple of hours on my laptop?
| reddit_clone wrote:
| It appears (from a brief glance) you can run it on a shared
| server. Only client runs on laptop.
| m3kw9 wrote:
| Why not embed names of functions and variables to form a vector
| so you are language agnostic? Are you limited by the language
| parser that embeds the names?
| hackncheese wrote:
| My work has 10ish repos we use, looks like this needs to be run
| in a specific git repo. Is there a way for this tool to run in a
| parent directory that contains all the repos we use with the same
| functionality?
| reddit_clone wrote:
| I had the same question. With modern (!?) microservices type
| development, functionality is spread all over the place in
| several repos. It would be great if SeaGOAT supports multiple
| repos.
| kantord wrote:
| That could be a new added feature, feel free to add a new issue
| on it
| GranPC wrote:
| Cool project! Just trying it out now - does it support CUDA
| acceleration? I'm running it on a rather large project and it
| claims it's got over 140k "tasks left in the queue", and I see no
| indicator of activity on nvidia-smi.
| eddywebs wrote:
| Cool beans! Does it work with python based codebase only or other
| could use it too ? Like java c#
|
| Thank you for sharing.
| freckletonj wrote:
| Hey OP, this looks awesome!
|
| I've done the same but was very disappointed with the stock
| sentence embedding results. You can get any arbitrary embedding,
| but then the cosine similarity used for nearest neighbor lookup
| gives a lot of false pos/negs.
|
| *There are 2 reasons:*
|
| 1. All embeddings from these models occupy a narrow cone of the
| total embedding space. Check out the cos sim of any 2 arbitrary
| strings. It'll be incredibly high! Even for gibberish and
| sensical sentences.
|
| 2. The dataset these SentenceTransformers are trained on don't
| include much code, and certainly not intentionally. At least I
| haven't found a code focused one yet.
|
| *There are solutions I've tried with mixed results:*
|
| 1. embedding "whitening" forces all the embeddings to be nearly
| orthogonal, meaning decorrelated. If you truncate the whitened
| embeddings, and keep just the top n eigenvalues, you get a sort
| of semantic compression that improves results.
|
| 2. train a super light neural net on your codebase's embeddings
| (takes seconds to train with a few layers) to improve nearest
| neighbor results. I suspect this helps because it rebiases
| learning to distinguish just among your codebase's embeddings.
|
| *There are solutions from the literature I am working on next
| that I find conceptually more promising:*
|
| 1. Chunk the codebase, and ask an LLM on each chunk to "generate
| a question to which this code is the answer". Then do natural
| language lookup on the question, and return the code for it.
|
| 2. You have your code lookup query. Ask an LLM to "generate a
| fabricated answer to this question". Then embed it's answer, and
| use that to do your lookup.
|
| 3. We use the AST of the code to further inform embeddings.
|
| I have this in my project UniteAI [1] and would love if you cared
| to collab on improving it (either directly, or via your repo and
| then building a dependency to it into UniteAI). I'm actually
| trying to collab more, so, this offer goes to anyone! I think for
| the future of AI to be owned by _us_ , we do that through these
| local-first projects and building strong communities.
|
| [1] https://github.com/freckletonj/uniteai
| [deleted]
| billconan wrote:
| if the code doesn't contain comments, can it still work?
|
| will it generate code comments for indexing using a language
| model? will that be expensive (assuming using GPT3)?
| FloatArtifact wrote:
| I would love to plumb this up with a speech recognition engine
| via commands as well as free dictation. I can see this being
| useful for navigating code semantically.
| [deleted]
| freckletonj wrote:
| UniteAI brings together speech recognition and document / code
| search. The major difference is your UI is your preferred text
| editor.
|
| https://github.com/freckletonj/uniteai
| signa11 wrote:
| thankfully perl is no longer in vogue.
| reddit_clone wrote:
| :-(
|
| There is still a lot of Perl code around. Something like this
| would be super useful.
| kantord wrote:
| actually I'm also working on a small web gui for it, it could
| be fairly easy to add speech recognition on the web version!
|
| https://github.com/kantord/SeaGOAT-web
| nxobject wrote:
| I'm looking forward to playing a little experiment with this: I'm
| going to run this on the Linux kernel tree, sight unseen, and
| knowing nothing about the structure of the Linux kernel - will it
| help me navigate it for the first time?
|
| Edit: processing chunks; see you tomorrow...
| [deleted]
| artisanspam wrote:
| What are the limitations on what languages this supports?
| kantord wrote:
| Currently it is hard limited to these file extensions:
| https://github.com/kantord/SeaGOAT/blob/ebfde263b970ddecdddf...
|
| It is to avoid wasting time processing files that cannot lead
| to good results. If you want to try it for a different
| programming language, please fork the repo and try adding your
| file formats and test if it gives meaningful results, and if it
| does please submit a pull request.
|
| Other than that one limitation is that it uses a model under
| the hood that is trained on a specific dataset which is
| filtered for a specific list of programming languages. So
| without changing the model as well, the support for other
| languages could be subpar. At the moment the model is all-
| MiniLM-L6-v2, here's a detailed summary of the dataset:
| https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...
| tinix wrote:
| extensions are configurable or truly hard coded?
| rockostrich wrote:
| Based on the code, they're hardcoded. It seems like it'd be
| pretty straightforward to add an override flag though.
| kantord wrote:
| it is hardcoded at the moment, but I am willing to merge
| code that adds the option to override.
|
| Also probably a flag would solve it for some users, the
| best way would be to add a configuration option. At the
| moment there are no config file/.rc file support in SeaGOAT
| though, but there is an issue to add it and I'm happy to
| merge pull requests:
| https://github.com/kantord/SeaGOAT/issues/180
| kantord wrote:
| update: I changed the hardcoded set of languages to
| support the following:
|
| Text Files ( _.txt) Markdown (_.md) Python ( _.py) C
| (_.c, `` _.h`) C++ (_.cpp, _.hpp) TypeScript (_.ts,
| _.tsx) JavaScript (_.js, _.jsx) HTML (_.html) Go ( _.go)
| Java (_.java) PHP ( _.php) Ruby (_.rb)
|
| https://github.com/kantord/SeaGOAT#what-programming-
| langauge...
| kantord wrote:
| also I plan to add features that incorporate a "dumb"
| analysis of the codebase in order to avoid spamming the
| results with mostly irrelevant results such as import
| statements or decorators. Those features would be language
| dependent, so support would need to be added for each
| language
| la64710 wrote:
| Just curious , did you use any LLM to generate code for this? BTW
| really awesome work!
| retrofuturism wrote:
| This would make a useful (nvim) Telescope plugin. Looks super
| interesting.
___________________________________________________________________
(page generated 2023-09-20 23:00 UTC)