[HN Gopher] Code search is hard
       ___________________________________________________________________
        
       Code search is hard
        
       Author : stevekrouse
       Score  : 132 points
       Date   : 2024-04-10 18:15 UTC (4 hours ago)
        
 (HTM) web link (blog.val.town)
 (TXT) w3m dump (blog.val.town)
        
       | semiquaver wrote:
       | Be careful with trigram indexes. At least in the postgres 10 era
       | they caused severe index bloat for frequently updated tables.
        
         | peter_l_downs wrote:
         | Interesting, do you know anywhere I can easily read more about
         | this? (I will do my own research, too.)
        
           | boyter wrote:
           | Its a result of trigrams themselves. For example turning
           | searchcode (please ignore plug, this is just the example I
           | had to hand) goes from 1 thing you would need to index into
           | 8.                   "searchcode"   -> [sea, ear, arc, rch,
           | chc, hco, cod, ode]
           | 
           | As a result the index rapidly becomes larger than you would
           | expect.
        
       | healeycodes wrote:
       | When a val is deployed on val town, my understanding is that it's
       | parsed/compiled. At that point, can you save the parts of the
       | program that people might search for? Names of imports,
       | functions, variables, comments, etc.
        
         | MH15 wrote:
         | A val is just Typescript, no? So unless they are also storing
         | the AST it would be JavaScript and that's it
        
       | nox101 wrote:
       | any nuggets here?
       | 
       | https://github.blog/2023-02-06-the-technology-behind-githubs...
        
         | SomaticPirate wrote:
         | Would also recommend the related video
         | https://www.youtube.com/watch?v=CqZA_KmygKw
        
         | boyter wrote:
         | Yes, although the lack of detail about the sparse grams is
         | frustrating.
        
       | skybrian wrote:
       | It seems like some of their gists have documentation attached and
       | maybe that's enough? I'm not sure I'm all that interested in
       | seeing undocumented gists in search results.
        
       | peter_l_downs wrote:
       | Surprised not to see Livegrep [0] on the list of options. Very
       | well-engineered technology; the codebase is clean (if a little
       | underdocumented on the architecture side) and you should be able
       | to index your code without much difficulty. Built with Bazel
       | (~meh, but useful if you don't have an existing cpp toolchain all
       | set up) and there are prebuilt containers you can run. Try that
       | first.
       | 
       | By the way, there's a demo running here for the linux kernel, you
       | can try it out and see what you think:
       | https://livegrep.com/search/linux
       | 
       | EDIT: by the way, "code search" is deeply underspecified. Before
       | trying to compare all these different options, you really would
       | benefit from writing down all the different types of queries you
       | think your users will want to ask, including _why they want to
       | run that query_ and _what results they 'd expect_.
       | Building/tuning search is almost as difficult a product problem
       | as it is an engineering problem.
       | 
       | [0] https://github.com/livegrep/livegrep
        
         | isker wrote:
         | When I investigated using livegrep for code search at work, it
         | really struggled to scale to a large number of repositories. At
         | least at the time (a few years ago) indexing in livegrep was a
         | monolithic operation: you index all repos at once, which
         | produces one giant index. This does not work well once you're
         | past a certain threshold.
         | 
         | I also recall that the indexes it produces are pretty
         | heavyweight in terms of memory requirements, but I don't have
         | any numbers on hand to justify that claim.
         | 
         | Zoekt (also mentioned in TFA) has the correct properties in
         | this regard. Except in niche configurations that are probably
         | only employed at sourcegraph, each repo is (re)indexed
         | independently and produces a separate set of index files.
         | 
         | But its builtin web UI left much to be desired (especially
         | compared to livegrep), so I built one:
         | https://github.com/isker/neogrok.
        
           | peter_l_downs wrote:
           | I like this better than livegrep. I haven't actually operated
           | either zoekt OR livegrep before, but I'll probably start with
           | zoekt+neogrok next time I want to stand up a codesearch page.
           | Thanks for building and sharing this!
        
       | ayberk wrote:
       | It indeed is hard, and a good code search platform makes life so
       | much easier. If I ever leave Google, the internal code search is
       | for sure going to be the thing I miss the most. It's so well
       | integrated into how everything else works (blaze target finding,
       | guice bindings etc), I can't imagine my life without it.
       | 
       | I remember to appreciate it even more every time I use Github's
       | search. Not that it's bad, it's just inherently so much harder to
       | build a generalized code search platform.
        
         | peter_l_downs wrote:
         | If you ever leave you can use Livegrep, which was based on
         | code-search work done at Google. I personally don't use it
         | right now but it's great and will probably meet all your needs.
         | 
         | [0] https://github.com/livegrep/livegrep
        
           | init wrote:
           | I've used both Code Search and Livegrep. No, Livegrep does
           | not even come close to what Code Search can do.
           | 
           | Sourcegraph is the closest thing I know of.
        
             | isker wrote:
             | Agreed. There are some public building blocks available
             | (e.g. Kythe or meta's Glean) but having something generic
             | that produces the kind of experience you can get on
             | cs.chromium.org seems impossible. You need such bespoke
             | build integration across an entire organization to get
             | there.
             | 
             | Basic text search, as opposed to navigation, is all you'll
             | get from anything out of the box.
        
               | init wrote:
               | In a past job I built a code search clone on top of
               | Kythe, Zoekt and LSP (for languages that didn't have
               | bazel integration). I got help from another colleague to
               | make the UI based on Monaco. We create a demo that many
               | people loved but we didn't productionize it for a few
               | reasons (it was an unfunded hackathon project and the
               | company was considering another solution when they
               | already had Livegrep)
               | 
               | Producing the Kythe graph from the bazel artifacts was
               | the most expensive part.
               | 
               | Working with Kythe is also not easy as there is no
               | documentation on how to run it at scale.
        
               | isker wrote:
               | Very cool. I tried to do things with Kythe at $JOB in the
               | past, but gave up because the build (really, the many
               | many independent builds) precluded any really useful
               | integration.
               | 
               | I did end up making a nice UI for vanilla Zoekt, as I
               | mentioned elsewhere: https://github.com/isker/neogrok.
        
             | tayo42 wrote:
             | Is there like a summary of what's missing from public
             | attempts and what makes it so much better?
        
               | sdesol wrote:
               | The short answer is context. The reason why Google's
               | internal code search is so good, is it is tied into their
               | build system. This means, when you search, you know
               | exactly what files to consider. Without context, you are
               | making an educated guess, with regards to what files to
               | consider.
        
               | riku_iki wrote:
               | How exactly integration with build system helps Google?
               | Maybe you could give specific example?..
        
               | sdesol wrote:
               | If you want to build a product with a build system, you
               | need to tell it what source to include. With this
               | information, you know what files to consider and if you
               | are dealing with a statically typed language like C or
               | C++, you have build artifacts that can tell you where the
               | implementation was defined. All of this, takes the guess
               | work out of answering questions like "What foo()
               | implentation was used".
               | 
               | If all you know are repo branches, the best you can do is
               | return matches from different repo branches with the
               | hopes that one of them is right.
               | 
               | Edit: I should also add that with a build system, you
               | know what version of a file to use.
        
               | isker wrote:
               | Try clicking around
               | https://source.chromium.org/chromium/chromium/src, which
               | is built with Kythe (I believe, or perhaps it's using
               | something internal to Google that Kythe is the open
               | source version of).
               | 
               | By hooking into C++ compilation, Kythe is giving you
               | things like _macro-aware_ navigation. Instead of trying
               | to process raw source text off to the side, it's using
               | the same data the compiler used to compile the code in
               | the first place. So things like cross-references are
               | "perfect", with no false positives in the results: Kythe
               | knows the difference between two symbols in two different
               | source files with the same name, whereas a search engine
               | naively indexing source text, or even something with
               | limited semantic knowledge like tree sitter, cannot
               | perfectly make the distinction.
        
               | j2kun wrote:
               | Google builds all the code in its momnorepo continuously,
               | and the built artifacts are available for the search.
               | Open source tools are never going to incur the cost of
               | actually building all the code it indexes.
        
             | birktj wrote:
             | I see most replies here ar mentioning the the build
             | integration is what is mainly missing in the public tools.
             | I wonder if nix and nixpkgs could be used here? Nix is a
             | language agnostic build-system and with nixpkgs it has a
             | build instructions for a massive amount of packages.
             | Artifacts for all packages are also available via hydra.
             | 
             | Nix should also have enough context so that for any project
             | it can get the source code of all dependencies and
             | (optionally) all build-time dependencies.
        
               | jeffbee wrote:
               | Build integration is not the main thing that is missing
               | between Livegrep and Code Search. The main thing that is
               | missing is the semantic index. Kythe knows the difference
               | between this::fn(int) and this::fn(double) and
               | that::fn(double) and so on. So you can find all the
               | callers of the nullary constructor of some class, without
               | false positives of the callers of the copy constructor or
               | the move constructor. Livegrep simply doesn't have that
               | ability at all. Livegrep is what it says it is on the
               | box: grep.
        
               | humanrebar wrote:
               | The build system coherence provided by a monorepo with a
               | single build system is what makes you understand
               | this::fn(double) as a single thing. Otherwise, you will
               | get N different mostly compatible but subtly different
               | flavors of entities depending on the build flavor,
               | combinations of versioned dependencies, and other things.
        
               | jeffbee wrote:
               | Sure. Also, if you eat a bunch of glass, you will get a
               | stomach ache. I have no idea why anyone uses a polyrepo.
        
               | humanrebar wrote:
               | The problem with monorepos is that they're so great that
               | everyone has a few.
        
           | keybored wrote:
           | > If you ever leave you can use Livegrep, which was based on
           | code-search work done at Google.
           | 
           | If I've learned anything from the fainting spells that
           | I-work-at-X have over their internal tools on HN: no,
           | whatever the public/OSS variant is always a mere _shadow_ of
           | _the real thing_.
        
             | jeffbee wrote:
             | It's not just that. Livegrep isn't just a pale imitation of
             | something inside Google. It's totally unrelated in
             | implementation, capabilities, and use case.
        
             | scubbo wrote:
             | I suspect you're being sarcastic - but can confirm that
             | being nearly two years out of Amazon, I still miss its in-
             | house CD system nearly every day. I've actively looked
             | around for OSS replacements and very few come anywhere
             | close.
             | 
             | (I would be _delighted_ for someone to "Umm actually" me by
             | providing a great product!)
        
               | mdaniel wrote:
               | My experience has been that any of these in-house things
               | do not adapt well to the _high chaos_ of external
               | environments, as if there are 3 companies one will find 9
               | systems and processes in use thus making  "one size fits
               | all" a fantasy
               | 
               | But, I'll bite: what made the CD system so dreamy, and
               | what have you evaluated thus far that fall short?
        
           | jeffbee wrote:
           | Just want to note that Livegrep, its antecedent "codesearch",
           | and other things that are basically grep bear no resemblance
           | to that which a person working at Google calls "Code Search".
        
       | chasil wrote:
       | Oracle has USER/ALL/DBA_SOURCE views, and all of the PL/SQL
       | (SQL/PSM) code that has been loaded into the database is
       | presented there. These are all cleartext visible unless they have
       | been purposefully obfuscated.
       | 
       | It has columns for the owner, object name, LINE[NUMBER] and
       | TEXT[VARCHAR2(4000)] columns and you can use LIKE or
       | regexp_like() on any of the retained source code.
       | 
       | I wonder if EnterpriseDB implements these inside of Postgres,
       | and/or if they are otherwise available as an extension.
       | 
       | Since most of SQL/PSM came from Oracle anyway, these would be an
       | obvious desired feature.
       | 
       | https://en.wikipedia.org/wiki/SQL/PSM
        
       | worldsayshi wrote:
       | I suppose using something like tree sitter to get a consistent
       | abstract syntax tree to work with would be a good starting point.
       | And then try building a custom analyzer (if using elasticsearch
       | lingo) with that?
        
         | azornathogron wrote:
         | Another option is to start with Kythe, which is Google's own
         | open source framework for getting a uniform cross-language
         | semantic layer: https://kythe.io/
         | 
         | Worth looking at as a source of inspiration and design ideas
         | even if you don't want to use it itself.
        
         | samatman wrote:
         | Might be overkill unless you're looking to do semantic search.
         | I've thought about what a search DSL for code would look like,
         | it's challenging to embody a query like "method which takes an
         | Int64 and has a variable idx in it" into something compact and
         | memorable.
         | 
         | But a tokenizer seems like a good place to start, I think
         | that's the right granularity for this kind of application.
         | You'd want to do some chunking so that foo.bar doesn't find
         | every foo and every bar, that sort of thing. Code search is, as
         | the title says, a hard problem. But a language-aware token
         | stream, the one you'd get from the lexer, is probably where one
         | should start in building the database.
        
           | worldsayshi wrote:
           | Sure you should definitively not try to do the overkill use
           | case first but I would assume that tree sitter can emit
           | "just" tokens as well? Getting the flexibility and control of
           | a tool like tree sitter should allow you to quickly throw
           | away stuff like comments and keywords if you want since you
           | can do syntax aware filtering.
           | 
           | Then again I haven't used tree-sitter, can just imagine that
           | this is a strength of it.
        
       | IshKebab wrote:
       | I would use Hound.
        
       | jackbravo wrote:
       | Would LLM vector embeddings work in this context? I'm guessing
       | they should since they are very good at understanding code.
        
         | CityOfThrowaway wrote:
         | Yes but generating that index would be expensive
        
           | anonymousDan wrote:
           | Why exactly? You mean to construct the embeddings or to embed
           | the queries?
        
       | sqs wrote:
       | I'm at Sourcegraph (mentioned in the blog post). We obviously
       | have to deal with massive scale, but for anyone starting out
       | adding code search to their product, I'd recommend not starting
       | with an index and just doing on-the-fly searching until that does
       | not scale. It actually will scale well for longer than you think
       | if you just need to find the first N matches (because that result
       | buffer can be filled without needing to search everything
       | exhaustively). Happy to chat with anyone who's building this kind
       | of thing, including with folks at Val Town, which is awesome.
        
         | isker wrote:
         | And when you're ready to do indexed search, Zoekt (over which
         | Sourcegraph graciously took maintainership a while ago) is the
         | best way to do it that I've found. After discounting both
         | Livegrep and Hound (they both struggled to perform in various
         | dimensions with the amount of stuff we wanted indexed, Hound
         | moreso than Livegrep), we migrated to Zoekt from a
         | (necessarily) very old and creaky deployment of OpenGrok and
         | it's night and day, both in terms of indexing performance and
         | search performance/ergonomics.
         | 
         | Sourcegraph of course adds many more sophisticated features on
         | top of just the code search that Zoekt provides.
        
         | morgante wrote:
         | I've been surprised at how far you can get without indexing.
         | 
         | Ex. I always assume we'll need to add an index to speed up
         | GritQL (https://github.com/getgrit/gritql), but we've gotten
         | pretty far with doing search entirely on the fly.
        
           | worldsayshi wrote:
           | What does 'on the fly' entail here?
        
             | simonw wrote:
             | I'm going to guess brute force - scan everything for the
             | search term, rather than trying to use an index.
             | 
             | I'm always amazed at how fast ripgrep (rg) can brute force
             | it's way through hundreds of MBs of source code.
        
               | morgante wrote:
               | Yes, exactly. When doing a search, we parse and search
               | every file without any indexing.
               | 
               | Of course, it could still be sped up considerably with an
               | index but brute force is surprisingly effective (we use
               | some of the same techniques/crates as ripgrep).
        
         | klysm wrote:
         | I apply this thinking to lots of problems. Do the dumb thing
         | that involves the least state and prove we need to lean more
         | towards memory for speed. It's much simpler to keep things
         | correct when nothing is cached
        
         | hinkley wrote:
         | There was someone doing temporal databases that was compressing
         | blocks on disk and doing streaming decompress and search on
         | them. Things in L2 cache go very very fast.
        
       | bawolff wrote:
       | Surprised that hound https://github.com/hound-search/hound isn't
       | mentioned. I thought it was the leader of open source solutions
       | in this space.
       | 
       | I've been using Wikimedia's instance (
       | https://codesearch.wmcloud.org/search/ ) and have generally been
       | pretty happy with what it provides.
        
         | isker wrote:
         | Hound has made an interesting choice to not bound searches.
         | https://codesearch.wmcloud.org/search/?q=test&files=&exclude...
         | produces an ajax request that (for me) took 13s to produce a
         | 55MB JSON response, and then takes many more seconds to render
         | into the DOM.
         | 
         | Properly bounding search response sizes was one of the things I
         | had to ensure Zoekt could do in its JSON APIs that I use in
         | neogrok: https://github.com/sourcegraph/zoekt/pull/615
        
           | bawolff wrote:
           | Yeah, i agree, that is weird. Especially if you search for
           | something super common like "function" you basically DoS it.
        
       | sdesol wrote:
       | > It's hard to find any accounts of code-search using FTS
       | 
       | I'm actually going to be doing this soon. I've thought about code
       | search for close to a decade, but I walked away from it, because
       | there really isn't a business for it. However, now with AI, I'm
       | more interested in using it to help find relevant context and I
       | have no reason to believe FTS won't work. In the past I used
       | Lucene, but I'm planning on going all in with Postgres.
       | 
       | The magic to fast code search (search in general), is keeping
       | things small. As long as your search solution is context aware,
       | you can easily leverage Postgres sharding to reduce index sizes.
       | I'm a strong believer in "disk space is cheap, time isn't", which
       | means I'm not afraid to create as many indexes as required, to
       | shave 100's of milliseconds of searches.
        
         | bevekspldnw wrote:
         | Mmm, it's not that straight forward: indexes can vastly slow
         | down large scale ingest, so it's really about _when_ to index
         | as well.
         | 
         | I work with a lot of multi billion row datasets and a lot of my
         | recent focus has been on developing strategies to avoid the
         | slow down with ingest, and then enjoying the speed up for
         | indexed on search.
         | 
         | I've also gotten some mjnd boggling speed increases by
         | summarizing key searchable data in smaller tables, some with
         | JSONB columns that are abstractions of other data, indexing
         | those, and using pg prewarm to serve those tables purely from
         | memory. I literally went from queries taking actual days to < 1
         | sec.
        
       | jessemhan wrote:
       | Good scalable codebase search is tough. We built a scalable,
       | fast, and super simple solution for codebase semantic search:
       | https://phorm.ai
        
       | Macha wrote:
       | > This is a pretty bad index: it has words that should be stop
       | words, like function, and won't split a.toString() into two
       | tokens because . is not a default word boundary.
       | 
       | So github used to (maybe still does) "fix" this one and it's
       | annoying. Although github are ramping up their IDE like find-
       | usages, it's still not perfect, so somethings you just want to a
       | text search equivalent for "foo.bar()" for all the uses it misses
       | and this stemming behaviour then finds every while where foo and
       | bar are mentioned which bloats results.
        
       | ricardobeat wrote:
       | I don't understand their hand-waving of Zoekt. It was built
       | exactly for this purpose, and is not a "new infrastructure
       | commitment" any more than the other options. The server is a
       | single binary, the indexer is also a single binary, can't get any
       | simpler than that.
       | 
       | To me it doesn't make sense to be more scared of it than
       | Elasticsearch...
        
       | philippemnoel wrote:
       | ParadeDB founder here. We'd love to be supported on Render, if
       | the Render folks are open to it...
        
       | 727564797069706 wrote:
       | If you're serious about scaling up, definitely consider Vespa
       | (https://vespa.ai).
       | 
       | At serious scale, Vespa will likely knock all the other options
       | out of the park.
        
       | herrington_d wrote:
       | Is it possible to combine n-gram and AST to dump a better
       | indexing?
       | 
       | Take `sourceCode.toString()` as an example, the AST can dump it
       | to `sourceCode` and `toString`. A further indexer can break
       | `sourceCode` to `source` and `code`.
       | 
       | For ast dumping, project like https://github.com/ast-grep/ast-
       | grep can help.
        
         | boyter wrote:
         | You could, but I don't know what you gain out of it. The
         | underlying index would be almost the same size, and n-gram
         | would also allow you to search for e.t for example which you
         | are losing in this process.
        
       | campbel wrote:
       | > Sourcegraph's maintained fork of Zoekt is pretty cool, but is
       | pretty fearfully niche and would be a big, new infrastructure
       | commitment.
       | 
       | I don't think Zoekt is as scary as this article makes it out to
       | be. I set this up at my current company after getting experience
       | with it at Shopify and its really great.
        
       | boyter wrote:
       | Code search is indeed hard. Stop words, stemming and such do rule
       | out most off the shelf indexing solutions but you can usually
       | turn them off. You can even get around the splitting issues of
       | things like                   a.toString()
       | 
       | With some pre-processing of the content. However were you really
       | get into a world of pain is allowing someone to search for ring
       | in the example. You can use partial term search, prefix, infix,
       | or suffix but this massively bloats the index and is slow to run.
       | 
       | The next thing you try is trigrams, and suddenly you have to deal
       | with false positive matches. So you add a positional portion to
       | your index, and all of a sudden the underlying index is larger
       | than the content you are indexing.
       | 
       | Its good fun though. For those curious about it I would also
       | suggest reading posts by Michael Stapelberg
       | https://michael.stapelberg.ch/posts/ who writes about Debian Code
       | Search (which I believe he started) in addition to the other
       | posts mentioned here. Shameless plug, I also write about this
       | https://boyter.org/posts/how-i-built-my-own-index-for-search...
       | where I go into some of the issues when building a custom index
       | for searchcode.com
       | 
       | Oddly enough I think you can go a long way brute forcing the
       | search if you don't do anything obviously wrong. For situations
       | where you are only allowed to search a small portion of the
       | content, say just your own (which looks applicable in this
       | situation) that's what I would do. Adding an index is really only
       | useful when you start searching at scale or you are getting
       | semantic search out of it. For keywords which is what the article
       | appears to be talking about, that's what I would be inclined to
       | do.
        
       | nbenitezl wrote:
       | Also https://github.com/Debian/dcs
        
       ___________________________________________________________________
       (page generated 2024-04-10 23:00 UTC)