[HN Gopher] Code search is hard
___________________________________________________________________
Code search is hard
Author : stevekrouse
Score : 132 points
Date : 2024-04-10 18:15 UTC (4 hours ago)
(HTM) web link (blog.val.town)
(TXT) w3m dump (blog.val.town)
| semiquaver wrote:
| Be careful with trigram indexes. At least in the postgres 10 era
| they caused severe index bloat for frequently updated tables.
| peter_l_downs wrote:
| Interesting, do you know anywhere I can easily read more about
| this? (I will do my own research, too.)
| boyter wrote:
| Its a result of trigrams themselves. For example turning
| searchcode (please ignore plug, this is just the example I
| had to hand) goes from 1 thing you would need to index into
| 8. "searchcode" -> [sea, ear, arc, rch,
| chc, hco, cod, ode]
|
| As a result the index rapidly becomes larger than you would
| expect.
| healeycodes wrote:
| When a val is deployed on val town, my understanding is that it's
| parsed/compiled. At that point, can you save the parts of the
| program that people might search for? Names of imports,
| functions, variables, comments, etc.
| MH15 wrote:
| A val is just Typescript, no? So unless they are also storing
| the AST it would be JavaScript and that's it
| nox101 wrote:
| any nuggets here?
|
| https://github.blog/2023-02-06-the-technology-behind-githubs...
| SomaticPirate wrote:
| Would also recommend the related video
| https://www.youtube.com/watch?v=CqZA_KmygKw
| boyter wrote:
| Yes, although the lack of detail about the sparse grams is
| frustrating.
| skybrian wrote:
| It seems like some of their gists have documentation attached and
| maybe that's enough? I'm not sure I'm all that interested in
| seeing undocumented gists in search results.
| peter_l_downs wrote:
| Surprised not to see Livegrep [0] on the list of options. Very
| well-engineered technology; the codebase is clean (if a little
| underdocumented on the architecture side) and you should be able
| to index your code without much difficulty. Built with Bazel
| (~meh, but useful if you don't have an existing cpp toolchain all
| set up) and there are prebuilt containers you can run. Try that
| first.
|
| By the way, there's a demo running here for the linux kernel, you
| can try it out and see what you think:
| https://livegrep.com/search/linux
|
| EDIT: by the way, "code search" is deeply underspecified. Before
| trying to compare all these different options, you really would
| benefit from writing down all the different types of queries you
| think your users will want to ask, including _why they want to
| run that query_ and _what results they 'd expect_.
| Building/tuning search is almost as difficult a product problem
| as it is an engineering problem.
|
| [0] https://github.com/livegrep/livegrep
| isker wrote:
| When I investigated using livegrep for code search at work, it
| really struggled to scale to a large number of repositories. At
| least at the time (a few years ago) indexing in livegrep was a
| monolithic operation: you index all repos at once, which
| produces one giant index. This does not work well once you're
| past a certain threshold.
|
| I also recall that the indexes it produces are pretty
| heavyweight in terms of memory requirements, but I don't have
| any numbers on hand to justify that claim.
|
| Zoekt (also mentioned in TFA) has the correct properties in
| this regard. Except in niche configurations that are probably
| only employed at sourcegraph, each repo is (re)indexed
| independently and produces a separate set of index files.
|
| But its builtin web UI left much to be desired (especially
| compared to livegrep), so I built one:
| https://github.com/isker/neogrok.
| peter_l_downs wrote:
| I like this better than livegrep. I haven't actually operated
| either zoekt OR livegrep before, but I'll probably start with
| zoekt+neogrok next time I want to stand up a codesearch page.
| Thanks for building and sharing this!
| ayberk wrote:
| It indeed is hard, and a good code search platform makes life so
| much easier. If I ever leave Google, the internal code search is
| for sure going to be the thing I miss the most. It's so well
| integrated into how everything else works (blaze target finding,
| guice bindings etc), I can't imagine my life without it.
|
| I remember to appreciate it even more every time I use Github's
| search. Not that it's bad, it's just inherently so much harder to
| build a generalized code search platform.
| peter_l_downs wrote:
| If you ever leave you can use Livegrep, which was based on
| code-search work done at Google. I personally don't use it
| right now but it's great and will probably meet all your needs.
|
| [0] https://github.com/livegrep/livegrep
| init wrote:
| I've used both Code Search and Livegrep. No, Livegrep does
| not even come close to what Code Search can do.
|
| Sourcegraph is the closest thing I know of.
| isker wrote:
| Agreed. There are some public building blocks available
| (e.g. Kythe or meta's Glean) but having something generic
| that produces the kind of experience you can get on
| cs.chromium.org seems impossible. You need such bespoke
| build integration across an entire organization to get
| there.
|
| Basic text search, as opposed to navigation, is all you'll
| get from anything out of the box.
| init wrote:
| In a past job I built a code search clone on top of
| Kythe, Zoekt and LSP (for languages that didn't have
| bazel integration). I got help from another colleague to
| make the UI based on Monaco. We create a demo that many
| people loved but we didn't productionize it for a few
| reasons (it was an unfunded hackathon project and the
| company was considering another solution when they
| already had Livegrep)
|
| Producing the Kythe graph from the bazel artifacts was
| the most expensive part.
|
| Working with Kythe is also not easy as there is no
| documentation on how to run it at scale.
| isker wrote:
| Very cool. I tried to do things with Kythe at $JOB in the
| past, but gave up because the build (really, the many
| many independent builds) precluded any really useful
| integration.
|
| I did end up making a nice UI for vanilla Zoekt, as I
| mentioned elsewhere: https://github.com/isker/neogrok.
| tayo42 wrote:
| Is there like a summary of what's missing from public
| attempts and what makes it so much better?
| sdesol wrote:
| The short answer is context. The reason why Google's
| internal code search is so good, is it is tied into their
| build system. This means, when you search, you know
| exactly what files to consider. Without context, you are
| making an educated guess, with regards to what files to
| consider.
| riku_iki wrote:
| How exactly integration with build system helps Google?
| Maybe you could give specific example?..
| sdesol wrote:
| If you want to build a product with a build system, you
| need to tell it what source to include. With this
| information, you know what files to consider and if you
| are dealing with a statically typed language like C or
| C++, you have build artifacts that can tell you where the
| implementation was defined. All of this, takes the guess
| work out of answering questions like "What foo()
| implentation was used".
|
| If all you know are repo branches, the best you can do is
| return matches from different repo branches with the
| hopes that one of them is right.
|
| Edit: I should also add that with a build system, you
| know what version of a file to use.
| isker wrote:
| Try clicking around
| https://source.chromium.org/chromium/chromium/src, which
| is built with Kythe (I believe, or perhaps it's using
| something internal to Google that Kythe is the open
| source version of).
|
| By hooking into C++ compilation, Kythe is giving you
| things like _macro-aware_ navigation. Instead of trying
| to process raw source text off to the side, it's using
| the same data the compiler used to compile the code in
| the first place. So things like cross-references are
| "perfect", with no false positives in the results: Kythe
| knows the difference between two symbols in two different
| source files with the same name, whereas a search engine
| naively indexing source text, or even something with
| limited semantic knowledge like tree sitter, cannot
| perfectly make the distinction.
| j2kun wrote:
| Google builds all the code in its momnorepo continuously,
| and the built artifacts are available for the search.
| Open source tools are never going to incur the cost of
| actually building all the code it indexes.
| birktj wrote:
| I see most replies here ar mentioning the the build
| integration is what is mainly missing in the public tools.
| I wonder if nix and nixpkgs could be used here? Nix is a
| language agnostic build-system and with nixpkgs it has a
| build instructions for a massive amount of packages.
| Artifacts for all packages are also available via hydra.
|
| Nix should also have enough context so that for any project
| it can get the source code of all dependencies and
| (optionally) all build-time dependencies.
| jeffbee wrote:
| Build integration is not the main thing that is missing
| between Livegrep and Code Search. The main thing that is
| missing is the semantic index. Kythe knows the difference
| between this::fn(int) and this::fn(double) and
| that::fn(double) and so on. So you can find all the
| callers of the nullary constructor of some class, without
| false positives of the callers of the copy constructor or
| the move constructor. Livegrep simply doesn't have that
| ability at all. Livegrep is what it says it is on the
| box: grep.
| humanrebar wrote:
| The build system coherence provided by a monorepo with a
| single build system is what makes you understand
| this::fn(double) as a single thing. Otherwise, you will
| get N different mostly compatible but subtly different
| flavors of entities depending on the build flavor,
| combinations of versioned dependencies, and other things.
| jeffbee wrote:
| Sure. Also, if you eat a bunch of glass, you will get a
| stomach ache. I have no idea why anyone uses a polyrepo.
| humanrebar wrote:
| The problem with monorepos is that they're so great that
| everyone has a few.
| keybored wrote:
| > If you ever leave you can use Livegrep, which was based on
| code-search work done at Google.
|
| If I've learned anything from the fainting spells that
| I-work-at-X have over their internal tools on HN: no,
| whatever the public/OSS variant is always a mere _shadow_ of
| _the real thing_.
| jeffbee wrote:
| It's not just that. Livegrep isn't just a pale imitation of
| something inside Google. It's totally unrelated in
| implementation, capabilities, and use case.
| scubbo wrote:
| I suspect you're being sarcastic - but can confirm that
| being nearly two years out of Amazon, I still miss its in-
| house CD system nearly every day. I've actively looked
| around for OSS replacements and very few come anywhere
| close.
|
| (I would be _delighted_ for someone to "Umm actually" me by
| providing a great product!)
| mdaniel wrote:
| My experience has been that any of these in-house things
| do not adapt well to the _high chaos_ of external
| environments, as if there are 3 companies one will find 9
| systems and processes in use thus making "one size fits
| all" a fantasy
|
| But, I'll bite: what made the CD system so dreamy, and
| what have you evaluated thus far that fall short?
| jeffbee wrote:
| Just want to note that Livegrep, its antecedent "codesearch",
| and other things that are basically grep bear no resemblance
| to that which a person working at Google calls "Code Search".
| chasil wrote:
| Oracle has USER/ALL/DBA_SOURCE views, and all of the PL/SQL
| (SQL/PSM) code that has been loaded into the database is
| presented there. These are all cleartext visible unless they have
| been purposefully obfuscated.
|
| It has columns for the owner, object name, LINE[NUMBER] and
| TEXT[VARCHAR2(4000)] columns and you can use LIKE or
| regexp_like() on any of the retained source code.
|
| I wonder if EnterpriseDB implements these inside of Postgres,
| and/or if they are otherwise available as an extension.
|
| Since most of SQL/PSM came from Oracle anyway, these would be an
| obvious desired feature.
|
| https://en.wikipedia.org/wiki/SQL/PSM
| worldsayshi wrote:
| I suppose using something like tree sitter to get a consistent
| abstract syntax tree to work with would be a good starting point.
| And then try building a custom analyzer (if using elasticsearch
| lingo) with that?
| azornathogron wrote:
| Another option is to start with Kythe, which is Google's own
| open source framework for getting a uniform cross-language
| semantic layer: https://kythe.io/
|
| Worth looking at as a source of inspiration and design ideas
| even if you don't want to use it itself.
| samatman wrote:
| Might be overkill unless you're looking to do semantic search.
| I've thought about what a search DSL for code would look like,
| it's challenging to embody a query like "method which takes an
| Int64 and has a variable idx in it" into something compact and
| memorable.
|
| But a tokenizer seems like a good place to start, I think
| that's the right granularity for this kind of application.
| You'd want to do some chunking so that foo.bar doesn't find
| every foo and every bar, that sort of thing. Code search is, as
| the title says, a hard problem. But a language-aware token
| stream, the one you'd get from the lexer, is probably where one
| should start in building the database.
| worldsayshi wrote:
| Sure you should definitively not try to do the overkill use
| case first but I would assume that tree sitter can emit
| "just" tokens as well? Getting the flexibility and control of
| a tool like tree sitter should allow you to quickly throw
| away stuff like comments and keywords if you want since you
| can do syntax aware filtering.
|
| Then again I haven't used tree-sitter, can just imagine that
| this is a strength of it.
| IshKebab wrote:
| I would use Hound.
| jackbravo wrote:
| Would LLM vector embeddings work in this context? I'm guessing
| they should since they are very good at understanding code.
| CityOfThrowaway wrote:
| Yes but generating that index would be expensive
| anonymousDan wrote:
| Why exactly? You mean to construct the embeddings or to embed
| the queries?
| sqs wrote:
| I'm at Sourcegraph (mentioned in the blog post). We obviously
| have to deal with massive scale, but for anyone starting out
| adding code search to their product, I'd recommend not starting
| with an index and just doing on-the-fly searching until that does
| not scale. It actually will scale well for longer than you think
| if you just need to find the first N matches (because that result
| buffer can be filled without needing to search everything
| exhaustively). Happy to chat with anyone who's building this kind
| of thing, including with folks at Val Town, which is awesome.
| isker wrote:
| And when you're ready to do indexed search, Zoekt (over which
| Sourcegraph graciously took maintainership a while ago) is the
| best way to do it that I've found. After discounting both
| Livegrep and Hound (they both struggled to perform in various
| dimensions with the amount of stuff we wanted indexed, Hound
| moreso than Livegrep), we migrated to Zoekt from a
| (necessarily) very old and creaky deployment of OpenGrok and
| it's night and day, both in terms of indexing performance and
| search performance/ergonomics.
|
| Sourcegraph of course adds many more sophisticated features on
| top of just the code search that Zoekt provides.
| morgante wrote:
| I've been surprised at how far you can get without indexing.
|
| Ex. I always assume we'll need to add an index to speed up
| GritQL (https://github.com/getgrit/gritql), but we've gotten
| pretty far with doing search entirely on the fly.
| worldsayshi wrote:
| What does 'on the fly' entail here?
| simonw wrote:
| I'm going to guess brute force - scan everything for the
| search term, rather than trying to use an index.
|
| I'm always amazed at how fast ripgrep (rg) can brute force
| it's way through hundreds of MBs of source code.
| morgante wrote:
| Yes, exactly. When doing a search, we parse and search
| every file without any indexing.
|
| Of course, it could still be sped up considerably with an
| index but brute force is surprisingly effective (we use
| some of the same techniques/crates as ripgrep).
| klysm wrote:
| I apply this thinking to lots of problems. Do the dumb thing
| that involves the least state and prove we need to lean more
| towards memory for speed. It's much simpler to keep things
| correct when nothing is cached
| hinkley wrote:
| There was someone doing temporal databases that was compressing
| blocks on disk and doing streaming decompress and search on
| them. Things in L2 cache go very very fast.
| bawolff wrote:
| Surprised that hound https://github.com/hound-search/hound isn't
| mentioned. I thought it was the leader of open source solutions
| in this space.
|
| I've been using Wikimedia's instance (
| https://codesearch.wmcloud.org/search/ ) and have generally been
| pretty happy with what it provides.
| isker wrote:
| Hound has made an interesting choice to not bound searches.
| https://codesearch.wmcloud.org/search/?q=test&files=&exclude...
| produces an ajax request that (for me) took 13s to produce a
| 55MB JSON response, and then takes many more seconds to render
| into the DOM.
|
| Properly bounding search response sizes was one of the things I
| had to ensure Zoekt could do in its JSON APIs that I use in
| neogrok: https://github.com/sourcegraph/zoekt/pull/615
| bawolff wrote:
| Yeah, i agree, that is weird. Especially if you search for
| something super common like "function" you basically DoS it.
| sdesol wrote:
| > It's hard to find any accounts of code-search using FTS
|
| I'm actually going to be doing this soon. I've thought about code
| search for close to a decade, but I walked away from it, because
| there really isn't a business for it. However, now with AI, I'm
| more interested in using it to help find relevant context and I
| have no reason to believe FTS won't work. In the past I used
| Lucene, but I'm planning on going all in with Postgres.
|
| The magic to fast code search (search in general), is keeping
| things small. As long as your search solution is context aware,
| you can easily leverage Postgres sharding to reduce index sizes.
| I'm a strong believer in "disk space is cheap, time isn't", which
| means I'm not afraid to create as many indexes as required, to
| shave 100's of milliseconds of searches.
| bevekspldnw wrote:
| Mmm, it's not that straight forward: indexes can vastly slow
| down large scale ingest, so it's really about _when_ to index
| as well.
|
| I work with a lot of multi billion row datasets and a lot of my
| recent focus has been on developing strategies to avoid the
| slow down with ingest, and then enjoying the speed up for
| indexed on search.
|
| I've also gotten some mjnd boggling speed increases by
| summarizing key searchable data in smaller tables, some with
| JSONB columns that are abstractions of other data, indexing
| those, and using pg prewarm to serve those tables purely from
| memory. I literally went from queries taking actual days to < 1
| sec.
| jessemhan wrote:
| Good scalable codebase search is tough. We built a scalable,
| fast, and super simple solution for codebase semantic search:
| https://phorm.ai
| Macha wrote:
| > This is a pretty bad index: it has words that should be stop
| words, like function, and won't split a.toString() into two
| tokens because . is not a default word boundary.
|
| So github used to (maybe still does) "fix" this one and it's
| annoying. Although github are ramping up their IDE like find-
| usages, it's still not perfect, so somethings you just want to a
| text search equivalent for "foo.bar()" for all the uses it misses
| and this stemming behaviour then finds every while where foo and
| bar are mentioned which bloats results.
| ricardobeat wrote:
| I don't understand their hand-waving of Zoekt. It was built
| exactly for this purpose, and is not a "new infrastructure
| commitment" any more than the other options. The server is a
| single binary, the indexer is also a single binary, can't get any
| simpler than that.
|
| To me it doesn't make sense to be more scared of it than
| Elasticsearch...
| philippemnoel wrote:
| ParadeDB founder here. We'd love to be supported on Render, if
| the Render folks are open to it...
| 727564797069706 wrote:
| If you're serious about scaling up, definitely consider Vespa
| (https://vespa.ai).
|
| At serious scale, Vespa will likely knock all the other options
| out of the park.
| herrington_d wrote:
| Is it possible to combine n-gram and AST to dump a better
| indexing?
|
| Take `sourceCode.toString()` as an example, the AST can dump it
| to `sourceCode` and `toString`. A further indexer can break
| `sourceCode` to `source` and `code`.
|
| For ast dumping, project like https://github.com/ast-grep/ast-
| grep can help.
| boyter wrote:
| You could, but I don't know what you gain out of it. The
| underlying index would be almost the same size, and n-gram
| would also allow you to search for e.t for example which you
| are losing in this process.
| campbel wrote:
| > Sourcegraph's maintained fork of Zoekt is pretty cool, but is
| pretty fearfully niche and would be a big, new infrastructure
| commitment.
|
| I don't think Zoekt is as scary as this article makes it out to
| be. I set this up at my current company after getting experience
| with it at Shopify and its really great.
| boyter wrote:
| Code search is indeed hard. Stop words, stemming and such do rule
| out most off the shelf indexing solutions but you can usually
| turn them off. You can even get around the splitting issues of
| things like a.toString()
|
| With some pre-processing of the content. However were you really
| get into a world of pain is allowing someone to search for ring
| in the example. You can use partial term search, prefix, infix,
| or suffix but this massively bloats the index and is slow to run.
|
| The next thing you try is trigrams, and suddenly you have to deal
| with false positive matches. So you add a positional portion to
| your index, and all of a sudden the underlying index is larger
| than the content you are indexing.
|
| Its good fun though. For those curious about it I would also
| suggest reading posts by Michael Stapelberg
| https://michael.stapelberg.ch/posts/ who writes about Debian Code
| Search (which I believe he started) in addition to the other
| posts mentioned here. Shameless plug, I also write about this
| https://boyter.org/posts/how-i-built-my-own-index-for-search...
| where I go into some of the issues when building a custom index
| for searchcode.com
|
| Oddly enough I think you can go a long way brute forcing the
| search if you don't do anything obviously wrong. For situations
| where you are only allowed to search a small portion of the
| content, say just your own (which looks applicable in this
| situation) that's what I would do. Adding an index is really only
| useful when you start searching at scale or you are getting
| semantic search out of it. For keywords which is what the article
| appears to be talking about, that's what I would be inclined to
| do.
| nbenitezl wrote:
| Also https://github.com/Debian/dcs
___________________________________________________________________
(page generated 2024-04-10 23:00 UTC)