[HN Gopher] Show HN: SeaGOAT - local, "AI-based" grep for semant...
       ___________________________________________________________________
        
       Show HN: SeaGOAT - local, "AI-based" grep for semantic code search
        
       Author : kantord
       Score  : 194 points
       Date   : 2023-09-20 12:13 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ithkuil wrote:
       | Interesting.
       | 
       | What would it take to support other programming languages?
        
       | jarulraj wrote:
       | Neat AI app!
       | 
       | 1. What feature extractor is used to derive code embeddings?
       | 
       | 2. Would support for more complex queries be useful inside the
       | app?                  --- Retrieve a subset of code snippets
       | SELECT name         FROM snippets        WHERE file_name LIKE
       | "%py" AND author_name LIKE "John%"        ORDER BY
       | Similarity(              CodeFeatureExtractor(Open(query)),
       | CodeFeatureExtractor(data)           )        LIMIT 5;
        
         | kantord wrote:
         | embeddings are done using ChromaDB
         | 
         | support for more complex queries could be useful, but probably
         | not using a query language since that would make it more
         | difficult to use free-form text input.
         | 
         | You can already use it using an API:
         | https://kantord.github.io/SeaGOAT/0.27.x/server/#understandi...
         | so probably the best way to add support for more complex
         | queries would be to have additional query parameters, and also
         | to expose those flags/options/features through the CLI
        
           | kantord wrote:
           | btw I am also working on a web version of it that will allow
           | you to search in multiple repositories at the same time and
           | you will be able to self host it at work, or run it locally
           | in your machine. https://github.com/kantord/SeaGOAT-web
           | 
           | so that could provide a nicer interactive experience for more
           | complex queries
        
           | dylanjcastillo wrote:
           | For those curious about it, ChromaDB uses all-MiniLM-L6-v2[0]
           | from Sentence Transformers[1] by default.
           | 
           | [0] https://docs.trychroma.com/embeddings#default-all-
           | minilm-l6-...
           | 
           | [1] https://www.sbert.net/docs/pretrained_models.html
        
       | jasonjmcghee wrote:
       | I've been test driving a similar one https://github.com/sturdy-
       | dev/semantic-code-search
       | 
       | But yours has a more permissive license!
       | 
       | I also had to modify it a bit to allow for the line endings I
       | needed and it frustratingly doesn't allow specifying a path, and
       | often returns tests instead of code
        
         | [deleted]
        
       | MisterTea wrote:
       | Is the naming coincidence or some sort of strange homage because
       | I can't help thinking GOATsea.
        
       | hollowpython wrote:
       | Does anyone know a tool like this but for arbitrary PDFs?
        
         | freedmand wrote:
         | Semantra! Shared it yesterday on HN
         | https://github.com/freedmand/semantra
        
         | freckletonj wrote:
         | If you're ok working in a text editor, UniteAI works on pdfs,
         | youtube transcripts, code repos, web pages, local documents,
         | etc. The nice thing about the editor is once it's done
         | retrieval, you can hit another keycombo to send retrieved
         | passages to an LLM (local, or chatgpt), and ask questions or
         | favors about it (such as summarization, or formatting changes).
         | 
         | https://github.com/freckletonj/uniteai
        
         | kantord wrote:
         | btw pdf support could probably be added to seagoat itself by
         | adding a layer that translates the pdf files to text files and
         | probably some added changed to make sure that the page number
         | is also included in the results
        
       | smoe wrote:
       | Looks very neat! Currently processing the repo I'm working on.
       | 
       | Can the generated database be easily shared within the team so
       | not everyone has to run the initial processing of the repo which
       | seems that it will take a couple of hours on my laptop?
        
         | reddit_clone wrote:
         | It appears (from a brief glance) you can run it on a shared
         | server. Only client runs on laptop.
        
       | m3kw9 wrote:
       | Why not embed names of functions and variables to form a vector
       | so you are language agnostic? Are you limited by the language
       | parser that embeds the names?
        
       | hackncheese wrote:
       | My work has 10ish repos we use, looks like this needs to be run
       | in a specific git repo. Is there a way for this tool to run in a
       | parent directory that contains all the repos we use with the same
       | functionality?
        
         | reddit_clone wrote:
         | I had the same question. With modern (!?) microservices type
         | development, functionality is spread all over the place in
         | several repos. It would be great if SeaGOAT supports multiple
         | repos.
        
         | kantord wrote:
         | That could be a new added feature, feel free to add a new issue
         | on it
        
       | GranPC wrote:
       | Cool project! Just trying it out now - does it support CUDA
       | acceleration? I'm running it on a rather large project and it
       | claims it's got over 140k "tasks left in the queue", and I see no
       | indicator of activity on nvidia-smi.
        
       | eddywebs wrote:
       | Cool beans! Does it work with python based codebase only or other
       | could use it too ? Like java c#
       | 
       | Thank you for sharing.
        
       | freckletonj wrote:
       | Hey OP, this looks awesome!
       | 
       | I've done the same but was very disappointed with the stock
       | sentence embedding results. You can get any arbitrary embedding,
       | but then the cosine similarity used for nearest neighbor lookup
       | gives a lot of false pos/negs.
       | 
       | *There are 2 reasons:*
       | 
       | 1. All embeddings from these models occupy a narrow cone of the
       | total embedding space. Check out the cos sim of any 2 arbitrary
       | strings. It'll be incredibly high! Even for gibberish and
       | sensical sentences.
       | 
       | 2. The dataset these SentenceTransformers are trained on don't
       | include much code, and certainly not intentionally. At least I
       | haven't found a code focused one yet.
       | 
       | *There are solutions I've tried with mixed results:*
       | 
       | 1. embedding "whitening" forces all the embeddings to be nearly
       | orthogonal, meaning decorrelated. If you truncate the whitened
       | embeddings, and keep just the top n eigenvalues, you get a sort
       | of semantic compression that improves results.
       | 
       | 2. train a super light neural net on your codebase's embeddings
       | (takes seconds to train with a few layers) to improve nearest
       | neighbor results. I suspect this helps because it rebiases
       | learning to distinguish just among your codebase's embeddings.
       | 
       | *There are solutions from the literature I am working on next
       | that I find conceptually more promising:*
       | 
       | 1. Chunk the codebase, and ask an LLM on each chunk to "generate
       | a question to which this code is the answer". Then do natural
       | language lookup on the question, and return the code for it.
       | 
       | 2. You have your code lookup query. Ask an LLM to "generate a
       | fabricated answer to this question". Then embed it's answer, and
       | use that to do your lookup.
       | 
       | 3. We use the AST of the code to further inform embeddings.
       | 
       | I have this in my project UniteAI [1] and would love if you cared
       | to collab on improving it (either directly, or via your repo and
       | then building a dependency to it into UniteAI). I'm actually
       | trying to collab more, so, this offer goes to anyone! I think for
       | the future of AI to be owned by _us_ , we do that through these
       | local-first projects and building strong communities.
       | 
       | [1] https://github.com/freckletonj/uniteai
        
         | [deleted]
        
       | billconan wrote:
       | if the code doesn't contain comments, can it still work?
       | 
       | will it generate code comments for indexing using a language
       | model? will that be expensive (assuming using GPT3)?
        
       | FloatArtifact wrote:
       | I would love to plumb this up with a speech recognition engine
       | via commands as well as free dictation. I can see this being
       | useful for navigating code semantically.
        
         | [deleted]
        
         | freckletonj wrote:
         | UniteAI brings together speech recognition and document / code
         | search. The major difference is your UI is your preferred text
         | editor.
         | 
         | https://github.com/freckletonj/uniteai
        
         | signa11 wrote:
         | thankfully perl is no longer in vogue.
        
           | reddit_clone wrote:
           | :-(
           | 
           | There is still a lot of Perl code around. Something like this
           | would be super useful.
        
         | kantord wrote:
         | actually I'm also working on a small web gui for it, it could
         | be fairly easy to add speech recognition on the web version!
         | 
         | https://github.com/kantord/SeaGOAT-web
        
       | nxobject wrote:
       | I'm looking forward to playing a little experiment with this: I'm
       | going to run this on the Linux kernel tree, sight unseen, and
       | knowing nothing about the structure of the Linux kernel - will it
       | help me navigate it for the first time?
       | 
       | Edit: processing chunks; see you tomorrow...
        
         | [deleted]
        
       | artisanspam wrote:
       | What are the limitations on what languages this supports?
        
         | kantord wrote:
         | Currently it is hard limited to these file extensions:
         | https://github.com/kantord/SeaGOAT/blob/ebfde263b970ddecdddf...
         | 
         | It is to avoid wasting time processing files that cannot lead
         | to good results. If you want to try it for a different
         | programming language, please fork the repo and try adding your
         | file formats and test if it gives meaningful results, and if it
         | does please submit a pull request.
         | 
         | Other than that one limitation is that it uses a model under
         | the hood that is trained on a specific dataset which is
         | filtered for a specific list of programming languages. So
         | without changing the model as well, the support for other
         | languages could be subpar. At the moment the model is all-
         | MiniLM-L6-v2, here's a detailed summary of the dataset:
         | https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...
        
           | tinix wrote:
           | extensions are configurable or truly hard coded?
        
             | rockostrich wrote:
             | Based on the code, they're hardcoded. It seems like it'd be
             | pretty straightforward to add an override flag though.
        
             | kantord wrote:
             | it is hardcoded at the moment, but I am willing to merge
             | code that adds the option to override.
             | 
             | Also probably a flag would solve it for some users, the
             | best way would be to add a configuration option. At the
             | moment there are no config file/.rc file support in SeaGOAT
             | though, but there is an issue to add it and I'm happy to
             | merge pull requests:
             | https://github.com/kantord/SeaGOAT/issues/180
        
               | kantord wrote:
               | update: I changed the hardcoded set of languages to
               | support the following:
               | 
               | Text Files ( _.txt) Markdown (_.md) Python ( _.py) C
               | (_.c, `` _.h`) C++ (_.cpp, _.hpp) TypeScript (_.ts,
               | _.tsx) JavaScript (_.js, _.jsx) HTML (_.html) Go ( _.go)
               | Java (_.java) PHP ( _.php) Ruby (_.rb)
               | 
               | https://github.com/kantord/SeaGOAT#what-programming-
               | langauge...
        
           | kantord wrote:
           | also I plan to add features that incorporate a "dumb"
           | analysis of the codebase in order to avoid spamming the
           | results with mostly irrelevant results such as import
           | statements or decorators. Those features would be language
           | dependent, so support would need to be added for each
           | language
        
       | la64710 wrote:
       | Just curious , did you use any LLM to generate code for this? BTW
       | really awesome work!
        
       | retrofuturism wrote:
       | This would make a useful (nvim) Telescope plugin. Looks super
       | interesting.
        
       ___________________________________________________________________
       (page generated 2023-09-20 23:00 UTC)