[HN Gopher] Code Search at Google: Han-Wen and Zoekt
___________________________________________________________________
Code Search at Google: Han-Wen and Zoekt
Author : intrepidsoldier
Score : 94 points
Date : 2023-11-21 17:10 UTC (5 hours ago)
(HTM) web link (sourcegraph.com)
(TXT) w3m dump (sourcegraph.com)
| jeffbee wrote:
| Is Zoekt actually in use at Google and if so how does it related
| to Kythe? I know the Zoekt instance for Bazel exists, but the
| Kythe index also exists
| (https://cs.opensource.google/bazel/bazel)
| dmoy wrote:
| It has nothing to do with Kythe.
|
| I'm on the Kythe team, and I don't know off the top of my head
| what Zoekt is. Looking it up, I see it's some sort of trigram
| search, which means if it's used at all (I have no idea), it's
| codesearch proper, not Kythe.
|
| The Kythe index is the semantic index of the codebase,
| Codesearch does all of the text/regex/etc searching.
| sluongng wrote:
| Are you sure? There is find definition and references in
| https://cs.opensource.google/bazel/bazel and Im quite sure
| it's thanks to the Kythe indexing job Bazel team is running
| in CI.
| dmoy wrote:
| The refs & jump to def in bazel/bazel are using Kythe, yes.
| But that is Kythe's semantic index from running (also it's
| Kythe team running it, not bazel team). It's not the
| Codesearch trigram/text search (which again, I have no idea
| if it uses zoekt).
| hanwenn wrote:
| Not in use that I know
| frutiger wrote:
| I'm a bit confused as to how
| https://swtch.com/~rsc/regexp/regexp4.html isn't mentioned at
| all.
| beyang wrote:
| Zoekt was heavily inspired by Google's internal code search, as
| mentioned in the blog post. The original version of the
| internal code search is described in the rsc post. Zoekt keeps
| some of the foundational ideas (e.g., trigram index), but was a
| from-scratch implementation. We probably should link to the rsc
| post for completeness, will update.
| hanwenn wrote:
| At the time that I started Zoekt (2016), Google's internal
| codesearch used suffix arrays for the string matching, which
| the team wasn't happy with, presumably because of the
| algorithmic complexity and indexing slowness. The Codesearch
| team was exploring alternatives, one of them the technique
| described in
| https://link.springer.com/article/10.1007/s11390-016-1618-6.
| The positional trigrams were a simplification of this, that
| they didn't mind me open sourcing.
|
| so, in terms of algorithms, Zoekt wasn't actually inspired by
| Google's internal code search.
|
| The precise query syntax of zoekt is mostly copied from
| google's internal syntax, though.
| IshKebab wrote:
| It is mentioned.
| frutiger wrote:
| It is now, but wasn't earlier.
| hanwenn wrote:
| Russ Cox' trigram approach uses document IDs for the posting
| list, which makes the index much smaller, but gives less
| precise (ie. slower) matching. This is mentioned in the design
| doc at https://github.com/sourcegraph/zoekt/blob/main/doc/desig
| n.md....
| j2kun wrote:
| IIUC, the main thing that Google's internal codesearch does that
| makes it superior to external systems (outside of an IDE, like
| GitHub code search) is that Google actually builds everything,
| and so it can incorporate that information into its index.
| There's only so much text search can do when you have macros
| generating code.
| dmoy wrote:
| Yea that would be Kythe. We build almost everything, across
| 44-45 different programming languages, and postprocess that
| into a giant semantic graph.
|
| Most major parts are open sourced at kythe.io, and there's a
| somewhat dated talk given by Luke here:
| https://youtu.be/VYI3ji8aSM0
| dmoy wrote:
| > macros
|
| Corollary: while we can do a lot with indexing generated code
| (even cross language) in Kythe, there are limits. Macros may
| be one, I forget atm
| sa46 wrote:
| Do you have any cases studies or success stories for non-
| Google repos? I miss code search but I'm not sure how close
| Kythe is to code-search-in-a-box.
| dmoy wrote:
| Internally, we use variants of our pipeline to index a
| variety of open source repos, and some non-blaze/bazel
| internal repos. Those are often non-Google repos. But we're
| using some internal postprocessing and serving logic to
| actually create and host the final index.
|
| Unfortunately I don't know if there's any significant use
| of Kythe outside of Google. We get a handful of questions
| on the open source repo from time to time, but that's all I
| know about.
| beyang wrote:
| Great call out! We've built this code navigation infra on top
| of Zoekt into Sourcegraph. Example:
| https://sourcegraph.com/github.com/golang/go/-/blob/src/net/...
|
| Docs:
| https://docs.sourcegraph.com/code_navigation/explanations/pr...
| tromp wrote:
| Wondering how this tool got named after the Dutch verb for seek,
| I found this quote on its github page [1].
|
| > "Zoekt, en gij zult spinazie eten" - Jan Eertink
|
| > ("seek, and ye shall eat spinach" - My primary school teacher)
|
| [1] https://github.com/sourcegraph/zoekt
| JohnMakin wrote:
| I didn't start my tech journey til late 00's, so it's constantly
| surprising to me that something as ubiquitous as git only came
| out in _2005_.
|
| Is it possible at all this story helped spur the widespread
| adoption of git (the early implementation of this tool)?
| jeffbee wrote:
| I think it is odd that the story mentions git at all. Git5, the
| mentioned wrapper around piper, has only a niche audience when
| I last used it 5 years ago, and it was a demonstrated fact that
| the users of it were less productive than perforce users.
| Whether that was causal or not was unknown.
| hanwenn wrote:
| Hi, I'm the Han-Wen from the title.
|
| The story mentions git because git5 got me into developer
| tooling. More in particular, it put me in touch with Shawn
| Pearce who ran the Git/Gerrit team at Google. When I went to
| work for him, Shawn wanted to have codesearch support in
| Gerrit, and Zoekt was ultimately the outcome of my
| explorations in this space.
|
| IIRC, Git5 was deleted approximately 5 years ago because Fig
| (the Hg based replacement) had taken over all the use cases
| billllll wrote:
| I agree there doesn't seem to be a good connection between
| work on version control and work on code search.
|
| However, I don't think it makes sense to downplay git5.
| Anecdotally, basically everybody knew about it, and I'd
| constantly run into people using it (which is by itself
| noteworthy since nobody was exactly talking about version
| control all the time).
|
| Git5 was at the time the most robust solution to chain
| commits, which was tedious bordering on impossible without
| some tool. Without definitive data, I wouldn't say users were
| less productive with git5: it definitely was a useful tool
| that people at least recommended for chain commits. I was
| definitely more productive with it.
|
| There were a lot of footguns though, and I do think the hg
| wrapper that superseded it was way better.
| ajross wrote:
| Git stepped into a source control ecosystem that was well-
| served (albeit contentious). People knew (or at least thought
| they knew) what they wanted from bk/svn/CVS/p4/rcs/sccs.
|
| So git essentially was the "final form" that integrated all the
| various workflows and topped it off with a maximally-scaled use
| case (linux) that proved out the tool, drove innovation in
| integration/scripting/gadgetry, and provided a clear beacon for
| everyone else to adopt it. So it won.
|
| But in 2004, if you asked around, everyone would have told you
| that a tool like this was coming at some point (even if they
| probably wouldn't have described it as very git-like!).
| pgeorgi wrote:
| If you squint a little, https://web.archive.org/web/200306291
| 14010/http://www.venge.... is a fair approximation of some of
| the core ideas behind git (and Linus played with it and wrote
| a critique of its short-comings before starting git)
| mettamage wrote:
| That is so cool to see where Linus got some of his
| inspiration from. It made a few things more clear to me as
| to why git uses certain things.
| justrealist wrote:
| Oh yeah. I remember merging SVN branches into production in
| 2010 or so.
|
| It was a... special time. Let's not reminisce.
| dekhn wrote:
| Before git, most people in my larger circle used RCS, a UNIX
| version control system from the early 80's. It was very limited
| (basically each file had its own side-file that contained
| revision data, and there was no project-wide file) but did its
| job. Many people moved over to VCS, which used RCS files but
| added project-wide files so you could manage a dir tree.
|
| After that, I think many people moved to subversion, which had
| a lot more functionality for distributed VC, for exmaple there
| was a server. svn was popular for a while but building it was
| painful (due to berkeley db) and it sort of never grew. I
| invested a lot of time in (specifically apache with mod_dav and
| mod_dav_svn) but lost interest in VC after fighting with
| subversion.
|
| git came along and from what i can tell it mainly had "it's by
| linus, and the kernel uses it" and "it's fast" and "something
| about reflogs". I use git day-to-day but I still; can't explain
| how git became so ubiquitous; I find using it outright painful.
| dws wrote:
| Lightweight branches was a huge selling point. If you didn't
| do them often enough that they were rote, branches in
| RCS/CVS/SVN required ritual sacrifice.
| reportingsjr wrote:
| Mercurial (aka hg) was also gaining popularity at the same
| time as git. The interface was a lot nicer and more sane than
| git, but it had some serious performance limitations that
| hampered it.
|
| Both were definitely way better than SVN/CVS/etc.
| dekhn wrote:
| Yes, after using git for a few years I was introduced to
| Mercurial and it was like a breath of fresh air, although
| I'm also told hg added a number of things that made it much
| more usable, "right before I started using it".
|
| Since I have limited brain capacity I focus my efforts on
| being able to use git, not hg, merely because it has so
| much marketshare.
| cpach wrote:
| Nit-pick: Did you mean CVS?
| dekhn wrote:
| Yes, CVS.
| IshKebab wrote:
| The same algorithm is also used in Hound
| (https://github.com/hound-search/hound) though I have to say the
| best implementation of code search by far that I've seen is
| https://grep.app
|
| You really should check it out if you haven't already. It's
| incredibly useful; I used it all the time. Not open source
| though.
___________________________________________________________________
(page generated 2023-11-21 23:00 UTC)