[HN Gopher] Glean - System for collecting, deriving and querying...
       ___________________________________________________________________
        
       Glean - System for collecting, deriving and querying facts about
       source code
        
       Author : dons
       Score  : 195 points
       Date   : 2021-08-31 09:56 UTC (13 hours ago)
        
 (HTM) web link (glean.software)
 (TXT) w3m dump (glean.software)
        
       | aabaker99 wrote:
       | Cool! I would love to play around with this.
       | 
       | How do I write a schema and indexer for my favorite programming
       | language that isn't currently (and won't be) supported with
       | official releases?
       | 
       | For Schemas, [1] says to modify (or base new ones off) these:
       | https://github.com/facebookincubator/Glean/tree/main/glean/s...
       | 
       | For Indexers, it's a little less clear but it looks like I need
       | to write my own type checker?
       | 
       | [1] https://glean.software/docs/schema/workflow
        
       | avinassh wrote:
       | How does this actually work? Where can I learn more about the
       | indexing and searching?
        
       | metalliqaz wrote:
       | We have used SciTools Understand to do this on local source code.
       | What is the use of putting this in the cloud? The website doesn't
       | really explain that.
        
       | tclancy wrote:
       | Getting a 401 when trying `docker pull
       | ghcr.io/facebookincubator/glean/demo:latest` -- is that true for
       | anyone else?
        
         | simonmar wrote:
         | Sorry about that, the package was still set to private. Try
         | again now?
        
           | tclancy wrote:
           | All set, thanks!
        
       | ing33k wrote:
       | 7GB docker image !
        
       | log101 wrote:
       | I didn't understand what it does
        
         | sealeck wrote:
         | It allows you to use a query language (think SQL) to analyze
         | source code.
        
           | tyingq wrote:
           | The docs seem to get to queries here:
           | https://glean.software/docs/angle/guide
        
       | balddenimhero wrote:
       | Datalog-ish query languages sure is a fun area to be working in.
       | Such DSLs exist for various domains and, like Semmle's codeQL or
       | the more academic Souffle, Glean focuses on the domain of
       | programming languages.
       | 
       | Glean seems to still be work in progress, e.g. no support for
       | recursive queries yet, but I wonder where they're heading. I'll
       | certainly keep an eye on the project but I wonder how exactly
       | Glean aims to -- or maybe it already does -- improve upon the
       | alternatives? From the talk linked in another comment I guess the
       | distinctive feature may be the planned integration with IDEs.
       | Correct me if I'm wrong. Other contenders provide great querying
       | technology but there is indeed no strong focus on making such
       | tech really convenient and integrated yet.
        
         | dons wrote:
         | I think the point in the space Glean hits well is
         | efficiency/latency (enough to power real time editing, like in
         | IDE autocomplete or navigation), while having a schema and
         | query language generic enough to do multiple languages and
         | code-like things. You can accurately query JavaScript or Rust
         | or PHP or Python or C++ with a common interface, which is a bit
         | nuts :D
        
       | soonnow wrote:
       | I had a look at the site and it seems to be parsing source code
       | in multiple languages and storing the parsed "syntax trees" into
       | a database for querying.
       | 
       | I would love to know what the usecase for this tool is aside from
       | maybe being a source for presentations? (We have 5 million if
       | statements).
       | 
       | How can this be used to improve code quality or any other aspect
       | of the code lifecycle?
       | 
       | Or is it solving problems in a completely different problem area?
        
         | lazamar wrote:
         | Glean is focused on storing and querying data about the code.
         | The idea is that you have your own program to collect that
         | data, then you use Glean to store that compactly and to have
         | snappy queries.
         | 
         | You would create entries like "this is a declaration of X",
         | "this is a use of X". Then you can query things like "give me
         | all uses of X" in sub-millisecond time. You hook that up to an
         | LSP server then you get almost zero-cost find-references, jump-
         | to-definition, etc. The snappy queries also mean it becomes
         | possible to perform whole codebase (and cross-language)
         | analysis. That is, answering questions like "what code is not
         | referenced from this root?", "does this Haskell function use
         | anything that calls malloc?" (analysis through the ffi
         | barrier).
         | 
         | One can also attach all kinds of information from different
         | sources to code entities, not only things derived from the
         | source itself. You add things like run-time costs, frequency of
         | use, common errors, etc, and an LSP server could make all of it
         | available right in your editor.
         | 
         | For very large or complex codebases, where it is just too
         | expensive or too complicated to calculate this information
         | locally a system like this becomes very useful.
        
           | gigatexal wrote:
           | Thank you for this summary I was unsure of how this is really
           | useful. That before step is missing I think.
        
           | scns wrote:
           | Oh wow, mindblowing stuff. Glad to see tech like this being
           | open sourced, fuels the imagination about possible future
           | scenarios. Do you use it on the Linux Kernel?
        
             | minxomat wrote:
             | A comparable, powerful system (CodeQL) was used recently on
             | the kernel[1] and Chrome. You can learn more about it here:
             | https://codeql.github.com/docs/codeql-overview/about-
             | codeql/
             | 
             | (disclosure: I work at GH on CQL)
             | 
             | [1] https://pwning.systems/posts/sequoia-variant-analysis/
        
               | X6S1x6Okd1st wrote:
               | Oof on the terms & conditions:
               | 
               | https://securitylab.github.com/tools/codeql/license/
        
           | soonnow wrote:
           | > For very large or complex codebases, where it is just too
           | expensive or too complicated to calculate this information
           | locally a system like this becomes very useful.
           | 
           | Thanks I guess I get it now. But to enable this functionality
           | you'd need to have some form of frontend or integration into
           | the existing build lifecycle?
           | 
           | Or IDE integration I guess.
        
       | rognjen wrote:
       | Meta: should the ?open tracking part of the URL be removed?
        
       | dons wrote:
       | We use this to power things like find-references or jump-to-def,
       | "symbol search" and autocomplete, or more complicated code
       | queries and analysis (even across languages). Imagine rich LSPs
       | without a local checkout, web-based code queries, or seeding
       | fuzzers and static analyzers with entry points in code.
       | 
       | Our focus has been on very large scale, multi-language code
       | indexing, and then low latency (e.g. hundreds of micros) query
       | times, to drive highly interactive developer workflows.
        
         | progval wrote:
         | How would it perform for, say, 500TB of source code?
         | 
         | And what would be the disk and memory requirements for this?
         | Could they be distributed across a handful of servers?
        
           | gricardo99 wrote:
           | What on earth has this much source code? Every open source
           | project ever?
        
             | gurleen_s wrote:
             | I mean, yeah. Imagine being able to do more rich queries
             | against GitHub.
        
             | progval wrote:
             | Yes, good guess! That's the size we have after
             | deduplication across projects at
             | https://www.softwareheritage.org/ . We archive all the
             | source code we can find; and would like to support some
             | sort of full-text search on it at some point, so Glean
             | looks interesting
        
           | dmos62 wrote:
           | I'd be surprised if this question could have an off hand
           | answer. Doesn't sound like something that could have
           | scalability predictable enough to do back of the envelope
           | calculations on.
        
         | pdpi wrote:
         | Been away from Fb for a few years. How does this relate to
         | tbgs?
        
           | gaogao wrote:
           | Jump to def is nice when biggrepping a piece of code a la
           | what you can do with codesearch, cs.android.com
        
         | gwbas1c wrote:
         | I'm really struggling to understand what Glean does, and why I
         | would use it. Most important: Your landing page should quickly
         | show what Glean does that a typical IDE (Visual Studio, Visual
         | Studio Code, Eclipse, ect, does.)
         | 
         | Specifically, things like "Go to definition," and tab
         | completion have been in industry-leading IDEs for at least 20
         | years.
         | 
         | What's novel about Glean? It seems like a lot of hoops to jump
         | through when Visual Studio (and Visual Studio Code) can index a
         | very large codebase in a few seconds. (And don't require a
         | server and database to do it.)
         | 
         | Perhaps a 20-second video (no sound) showing what Glean does
         | that other IDEs don't will help get the message across?
        
           | fnord77 wrote:
           | "Go to definition" has been around even longer, since at
           | least the early 90s
        
             | jhayward wrote:
             | I don't recall which version of Emacs first had "go to
             | definition", but it was well before the 90's.
        
           | n_jd wrote:
           | I don't know what Glean is used for, but here are some
           | guesses for this kind of technology:
           | 
           | - find references / go to definition for web tools, like when
           | reviewing pull requests
           | 
           | - multi-language refactoring, e.g. modifying C bindings
           | 
           | - building structural static analysis tools like coccinelle,
           | or semgrep, but better
        
           | maccard wrote:
           | What size codebases do you have that a few seconds has visual
           | studio fully indexing it? My experience with VS on large
           | projects is that it takes however long the project takes to
           | compile before it's usable, but many functions (go to
           | definition) can occasionally hit a file that needs to be
           | reparsed and can stall for minutes on end. I use Vs2019 on a
           | 32 core workstation with 128GB ram, fwiw.
        
           | conradev wrote:
           | This makes a lot of sense to me through an efficiency lens.
           | 
           | Facebook could spend a lot of money to get engineers beefy
           | workstations, and then have each of these workstations clone
           | the same repository and build the same index locally.
           | 
           | Or, they could leverage the custom built servers in their
           | data centers (which are already more energy-efficient than
           | the laptops), build a single index of the repo, and serve
           | queries on-demand from IDEs throughout the company.
           | 
           | I could also see an analytics angle to this if it could
           | incorporate history and track engineering trends over time.
           | In my experience, decision making in engineering around
           | codebase maintenance is usually rooted in "experience" or
           | "educating guessing" rather than identifying areas of high
           | churn in the codebase or what not.
        
           | masukomi wrote:
           | 100% same take.
           | 
           | I'd add that I didn't want to click "get started" because i
           | didn't know if it was a thing i wanted, and then "get
           | started" actually took me to documentation, which is not what
           | i expect from a "get started" button. The Documentation had
           | the presumption that i wanted to use it, and thus the
           | implication that i knew wtf "it" was.
           | 
           | I don't care about its efficiency, or declarative language,
           | or any of that when i still don't know what we're talking
           | about.
        
           | WastingMyTime89 wrote:
           | > It seems like a lot of hoops to jump through when Visual
           | Studio (and Visual Studio Code) can index a very large
           | codebase in a few seconds.
           | 
           | I think you are not thinking large enough. An IDE absolutely
           | can not index a very large codebase and allow users to make
           | complex queries on it. Think multiple millions lines of code
           | here. The use case is closer to "find me all the variables of
           | this type or a type derived from it in all the projects at
           | Facebook" than "go to this definition in the project I'm
           | currently editing".
        
             | Syzygies wrote:
             | There's large, and there's scope. I use VSCode to dabble in
             | dozens of projects across a dozen languages at a time,
             | often coming back to fix things after years. VSCode is
             | great at telling me what I did in the current project, but
             | I can't remember library calls or even syntax without
             | looking at something I wrote before. My efficiency is
             | perhaps 50% at recalling where to look; a tool that kept my
             | entire corpus at my fingertips would be extremely welcome.
             | But I'm failing to see how this is that.
        
               | fragmede wrote:
               | If you've not had to deal with a codebase that takes
               | VSCode longer than a few minutes to index, then you're
               | probably outside their initial target market. If you've
               | not had to setup a hosted code search tool (eg livegrep
               | https://github.com/livegrep/livegrep ) because there's
               | just too much code, you've been lucky. If your projects
               | can be scoped, and not pull in code from dozens of
               | libraries, across dozens of teams, many of which are on
               | different continents, you're doing a better job of
               | organizing code than I've been able to manage.
        
         | the_duke wrote:
         | This is really cool.
         | 
         | Seems like there are only indexers for Flow and Hack though.
         | 
         | Will there be more indexers built by Facebook, or will it rely
         | on community contributions?
        
           | dons wrote:
           | A bit of both I think.
        
           | simonmar wrote:
           | There will be more indexers: we have Python, C++/Objective C,
           | Rust, Java and Haskell. It's just a case of getting them
           | ready to open source. You can see the schemas for most of
           | these already in the repo: https://github.com/facebookincubat
           | or/Glean/tree/main/glean/s...
        
         | mhitza wrote:
         | Briefly skimmed the docs and it noted that it doesn't store
         | expressions from the parsed AST. That means it's mostly a
         | symbol lookup system?
         | 
         | When doing large system refactoring searching by code patterns
         | is the number one thing I'd like to have a tool for. For
         | example being able to query for all for loops in a codebase
         | that have a call to function X within their body.
        
         | soonnow wrote:
         | Does that mean you are using the shell or how is it used to
         | enable these functionalities?
        
           | dons wrote:
           | Most clients hit the Glean server via the network
           | (thrift/JSON) and then mostly via language bindings to the
           | Glean query language, Angle. The shell is more for
           | debugging/exploration.
           | 
           | Imagine an IDE plugin that queries Glean over the network for
           | symbol information about the current file, then shows that on
           | hover. That sort of thing.
        
             | soonnow wrote:
             | Alright gotcha. Thanks for the clarification.
        
         | zerr wrote:
         | Since this is HN, could you please share more technical/impl
         | details, e.g. what makes it more scalable and faster in general
         | and also compared to other similar engines?
        
         | gravypod wrote:
         | I see you support Thrift and Buck. Would you also be interested
         | in adding Proto and Bazel support? Being able to query the code
         | based on the build graph (sort of) would be very cool.
        
       | z3t4 wrote:
       | This seems very interesting, would love to see more alternatives
       | to TreeSitter and microsoft LSP - what makes those hard to use is
       | lack of examples and tutorials. So I hope tbere will be examples
       | and tutorials. For example: How do you find all variables in
       | scope when the text cursor is on line x and col y in
       | /file/path/file.js
        
       | marcodiego wrote:
       | Is it a modern cscope?
        
       | booleandilemma wrote:
       | The very first page of the site should have examples of what you
       | can do with it.
        
       | enjikaka wrote:
       | Ew, Facebook.
        
       | conductor wrote:
       | To prevent any confusion, this is a different product than
       | Mozilla's Glean [0][1].
       | 
       | [0] https://docs.telemetry.mozilla.org/concepts/glean/glean.html
       | 
       | [1] https://github.com/mozilla/glean/
        
         | senden9 wrote:
         | I was also confused first if (Mozilla) glean gained some out-
         | of-scope features.
        
         | [deleted]
        
       | coderdd wrote:
       | Great to see this space moving! Any pointers on diff vs Kythe?
       | Kythe has a mostly fixed schema, for one.
       | 
       | One of the pain points using Kythe is wiring up the indexer to
       | the build system. Would Glean indexers be easier to wire up for
       | the common cases?
       | 
       | Other is the index post-processing, which is not very scalable in
       | the open source version (due to go-beam having rough Flunk
       | support, for example).
       | 
       | Third, how does it link up references across compilation units?
       | Is it heuristic, or relies on unique keys from indexers matching?
       | Or across languages?
        
         | simonmar wrote:
         | Kythe has one schema, whereas with Glean each language has its
         | own schema with arbitrary amounts of language-specific detail.
         | You can get a language-agnostic view by defining an abstraction
         | layer as a schema. Our current (work in progress) language-
         | agnostic layer is called "codemarkup"
         | https://github.com/facebookincubator/Glean/blob/main/glean/s...
         | 
         | For wiring up the indexer, there are various methods, it tends
         | to depend very much on the language and the build system. For
         | Flow for example, Glean output is just built into the
         | typechecker, you just run it with some flags to spit out the
         | Glean data. For C++, you need to get the compiler flags from
         | the build system to pass to the Clang frontend. For Java the
         | indexer is a compiler plugin; for Python it's built on libCST.
         | Some indexers send their data directly to a Glean server,
         | others generate files of JSON that get sent using a separate
         | command-line tool.
         | 
         | References use different methods depending on the language. For
         | Flow for example there is a fact for an import that matches up
         | with a fact for the export in the other file. For C++ there are
         | facts that connect declarations with definitions, and
         | references with declarations.
        
           | mrazomor wrote:
           | In case using Kythe was an option, what was the rationale for
           | not using it?
           | 
           | One major limitation of Kythe is handling different versions.
           | For example, Kythe can produce a well connected index of
           | Stackage, but a Hackage would have many holes (not all
           | references would be found, as the unique reference name needs
           | the library version). How Glean handles different library
           | versions?
           | 
           | EDIT: the language agnostic view is already mentioned.
        
           | Game_Ender wrote:
           | Is there an example of using the C++ indexer? I saw hack and
           | JS on your site but missed C++ (Python would also be
           | amazing!).
        
             | simonmar wrote:
             | We want to open-source the C++ and Python indexers but
             | they're not ready yet - we have to separate them from
             | internal build-system-specific bits.
        
       | ctvo wrote:
       | Great job with this. What's your roadmap for releasing some of
       | the tooling for editor integration? Really, the question is
       | should I build something or wait a few weeks?
        
       | erlich wrote:
       | I can't believe Facebook hasn't canned Flowtype yet and moved to
       | TypeScript. They will have to do it eventually.
        
         | muglug wrote:
         | I'm not sure you understand the scale at which Facebook
         | operates. They don't have to do anything.
         | 
         | As long as billions of people keep using Facebook they can
         | maintain their own static analysis tooling for Javascript for
         | as long as they want.
        
           | scns wrote:
           | You do have a point, a rewrite on that scale would be a
           | colossal waste of manyears/$$. Your delivery could be nicer
           | though.
        
         | wingspan wrote:
         | The problem is that TypeScript does not scale to the size of
         | the giant monorepo at Facebook, with hundreds of thousands, if
         | not millions of files. Since they aren't organized into
         | packages, it is just one giant flat namespace (any JS file can
         | import any other JS file by the filename). It is pretty amazing
         | to change a core file and see type errors across the entire
         | codebase in a few seconds. The main way to scale in TypeScript
         | is Project References, which don't work when you haven't
         | separated your code into packages. (Worked at Facebook until
         | June 2021).
        
         | ctvo wrote:
         | Doesn't look like they're stopping their use of Hack either.
         | Eventually is a long time so you're right.
        
       | simonw wrote:
       | Feature request: a live demo! I would love to try out the web
       | interface described at https://glean.software/docs/trying without
       | pulling down a 7GB Docker image first.
        
         | [deleted]
        
         | jamessb wrote:
         | Even just a short video of someone using the web interface
         | would be helpful.
        
       | _jezell_ wrote:
       | Is this basically Facebook's version of SourceGraph?
        
       | doddsiedodds wrote:
       | An excellent talk by Simon Marlow on Glean here:
       | https://youtu.be/-OPN7QPsYKE
        
         | simonmar wrote:
         | I should point out that Glean has evolved quite a bit since
         | that talk!
        
       | justinmchase wrote:
       | But whats an example of a fact? Looks cool but I have no idea
       | what its for.
        
       | da39a3ee wrote:
       | I was recently looking for a library that takes a few lines of
       | source code as input, and predicts the programming language as
       | output.
       | 
       | That seems like a very tractable machine learning problem, yet
       | all I could find was a single python library which looks nice,
       | but doesn't have much adoption, and requires installing the
       | entirety of tensorflow despite the fact that users just want a
       | trained model and a predict() function.
       | 
       | Why doesn't a popular library like this exist?
        
         | jamessb wrote:
         | GitHub's linguist library can be used to identify the
         | programming language of a single file (edit: or of a whole
         | project): https://github.com/github/linguist#single-file
        
           | da39a3ee wrote:
           | Thanks! My searches completely failed to find that. I can't
           | use it as a ruby library, but perhaps I can pull out the
           | heuristics.yml and the naive bayes classifier weights to use
           | in another language.
        
       | ExtraE wrote:
       | What, uh, is this? This is a space that I'm not familiar with and
       | the linked site doesn't make it super clear.
        
       | Grimm1 wrote:
       | Very cool! How does this differ algorithmically from the trigram
       | based search that everything uses from google code search from
       | like 20 years ago?
       | 
       | And continuing off of that theme in practical terms how does it
       | stand up against zoekt?
       | 
       | I'm curious because zoekt is kind of slow when it comes to
       | ingesting large amounts of code like all of the publicly
       | available code on GitHub
       | 
       | The few people using that commercially have basically had to
       | spend a lot of time rewriting parts of it to make their goal of
       | public codesearch for all attainable.
       | 
       | I and a few people I know are pretty convinced that there are
       | better and easier ways / technologies to make that happen.
        
       ___________________________________________________________________
       (page generated 2021-08-31 23:02 UTC)