[HN Gopher] Glean - System for collecting, deriving and querying...
___________________________________________________________________
Glean - System for collecting, deriving and querying facts about
source code
Author : dons
Score : 195 points
Date : 2021-08-31 09:56 UTC (13 hours ago)
(HTM) web link (glean.software)
(TXT) w3m dump (glean.software)
| aabaker99 wrote:
| Cool! I would love to play around with this.
|
| How do I write a schema and indexer for my favorite programming
| language that isn't currently (and won't be) supported with
| official releases?
|
| For Schemas, [1] says to modify (or base new ones off) these:
| https://github.com/facebookincubator/Glean/tree/main/glean/s...
|
| For Indexers, it's a little less clear but it looks like I need
| to write my own type checker?
|
| [1] https://glean.software/docs/schema/workflow
| avinassh wrote:
| How does this actually work? Where can I learn more about the
| indexing and searching?
| metalliqaz wrote:
| We have used SciTools Understand to do this on local source code.
| What is the use of putting this in the cloud? The website doesn't
| really explain that.
| tclancy wrote:
| Getting a 401 when trying `docker pull
| ghcr.io/facebookincubator/glean/demo:latest` -- is that true for
| anyone else?
| simonmar wrote:
| Sorry about that, the package was still set to private. Try
| again now?
| tclancy wrote:
| All set, thanks!
| ing33k wrote:
| 7GB docker image !
| log101 wrote:
| I didn't understand what it does
| sealeck wrote:
| It allows you to use a query language (think SQL) to analyze
| source code.
| tyingq wrote:
| The docs seem to get to queries here:
| https://glean.software/docs/angle/guide
| balddenimhero wrote:
| Datalog-ish query languages sure is a fun area to be working in.
| Such DSLs exist for various domains and, like Semmle's codeQL or
| the more academic Souffle, Glean focuses on the domain of
| programming languages.
|
| Glean seems to still be work in progress, e.g. no support for
| recursive queries yet, but I wonder where they're heading. I'll
| certainly keep an eye on the project but I wonder how exactly
| Glean aims to -- or maybe it already does -- improve upon the
| alternatives? From the talk linked in another comment I guess the
| distinctive feature may be the planned integration with IDEs.
| Correct me if I'm wrong. Other contenders provide great querying
| technology but there is indeed no strong focus on making such
| tech really convenient and integrated yet.
| dons wrote:
| I think the point in the space Glean hits well is
| efficiency/latency (enough to power real time editing, like in
| IDE autocomplete or navigation), while having a schema and
| query language generic enough to do multiple languages and
| code-like things. You can accurately query JavaScript or Rust
| or PHP or Python or C++ with a common interface, which is a bit
| nuts :D
| soonnow wrote:
| I had a look at the site and it seems to be parsing source code
| in multiple languages and storing the parsed "syntax trees" into
| a database for querying.
|
| I would love to know what the usecase for this tool is aside from
| maybe being a source for presentations? (We have 5 million if
| statements).
|
| How can this be used to improve code quality or any other aspect
| of the code lifecycle?
|
| Or is it solving problems in a completely different problem area?
| lazamar wrote:
| Glean is focused on storing and querying data about the code.
| The idea is that you have your own program to collect that
| data, then you use Glean to store that compactly and to have
| snappy queries.
|
| You would create entries like "this is a declaration of X",
| "this is a use of X". Then you can query things like "give me
| all uses of X" in sub-millisecond time. You hook that up to an
| LSP server then you get almost zero-cost find-references, jump-
| to-definition, etc. The snappy queries also mean it becomes
| possible to perform whole codebase (and cross-language)
| analysis. That is, answering questions like "what code is not
| referenced from this root?", "does this Haskell function use
| anything that calls malloc?" (analysis through the ffi
| barrier).
|
| One can also attach all kinds of information from different
| sources to code entities, not only things derived from the
| source itself. You add things like run-time costs, frequency of
| use, common errors, etc, and an LSP server could make all of it
| available right in your editor.
|
| For very large or complex codebases, where it is just too
| expensive or too complicated to calculate this information
| locally a system like this becomes very useful.
| gigatexal wrote:
| Thank you for this summary I was unsure of how this is really
| useful. That before step is missing I think.
| scns wrote:
| Oh wow, mindblowing stuff. Glad to see tech like this being
| open sourced, fuels the imagination about possible future
| scenarios. Do you use it on the Linux Kernel?
| minxomat wrote:
| A comparable, powerful system (CodeQL) was used recently on
| the kernel[1] and Chrome. You can learn more about it here:
| https://codeql.github.com/docs/codeql-overview/about-
| codeql/
|
| (disclosure: I work at GH on CQL)
|
| [1] https://pwning.systems/posts/sequoia-variant-analysis/
| X6S1x6Okd1st wrote:
| Oof on the terms & conditions:
|
| https://securitylab.github.com/tools/codeql/license/
| soonnow wrote:
| > For very large or complex codebases, where it is just too
| expensive or too complicated to calculate this information
| locally a system like this becomes very useful.
|
| Thanks I guess I get it now. But to enable this functionality
| you'd need to have some form of frontend or integration into
| the existing build lifecycle?
|
| Or IDE integration I guess.
| rognjen wrote:
| Meta: should the ?open tracking part of the URL be removed?
| dons wrote:
| We use this to power things like find-references or jump-to-def,
| "symbol search" and autocomplete, or more complicated code
| queries and analysis (even across languages). Imagine rich LSPs
| without a local checkout, web-based code queries, or seeding
| fuzzers and static analyzers with entry points in code.
|
| Our focus has been on very large scale, multi-language code
| indexing, and then low latency (e.g. hundreds of micros) query
| times, to drive highly interactive developer workflows.
| progval wrote:
| How would it perform for, say, 500TB of source code?
|
| And what would be the disk and memory requirements for this?
| Could they be distributed across a handful of servers?
| gricardo99 wrote:
| What on earth has this much source code? Every open source
| project ever?
| gurleen_s wrote:
| I mean, yeah. Imagine being able to do more rich queries
| against GitHub.
| progval wrote:
| Yes, good guess! That's the size we have after
| deduplication across projects at
| https://www.softwareheritage.org/ . We archive all the
| source code we can find; and would like to support some
| sort of full-text search on it at some point, so Glean
| looks interesting
| dmos62 wrote:
| I'd be surprised if this question could have an off hand
| answer. Doesn't sound like something that could have
| scalability predictable enough to do back of the envelope
| calculations on.
| pdpi wrote:
| Been away from Fb for a few years. How does this relate to
| tbgs?
| gaogao wrote:
| Jump to def is nice when biggrepping a piece of code a la
| what you can do with codesearch, cs.android.com
| gwbas1c wrote:
| I'm really struggling to understand what Glean does, and why I
| would use it. Most important: Your landing page should quickly
| show what Glean does that a typical IDE (Visual Studio, Visual
| Studio Code, Eclipse, ect, does.)
|
| Specifically, things like "Go to definition," and tab
| completion have been in industry-leading IDEs for at least 20
| years.
|
| What's novel about Glean? It seems like a lot of hoops to jump
| through when Visual Studio (and Visual Studio Code) can index a
| very large codebase in a few seconds. (And don't require a
| server and database to do it.)
|
| Perhaps a 20-second video (no sound) showing what Glean does
| that other IDEs don't will help get the message across?
| fnord77 wrote:
| "Go to definition" has been around even longer, since at
| least the early 90s
| jhayward wrote:
| I don't recall which version of Emacs first had "go to
| definition", but it was well before the 90's.
| n_jd wrote:
| I don't know what Glean is used for, but here are some
| guesses for this kind of technology:
|
| - find references / go to definition for web tools, like when
| reviewing pull requests
|
| - multi-language refactoring, e.g. modifying C bindings
|
| - building structural static analysis tools like coccinelle,
| or semgrep, but better
| maccard wrote:
| What size codebases do you have that a few seconds has visual
| studio fully indexing it? My experience with VS on large
| projects is that it takes however long the project takes to
| compile before it's usable, but many functions (go to
| definition) can occasionally hit a file that needs to be
| reparsed and can stall for minutes on end. I use Vs2019 on a
| 32 core workstation with 128GB ram, fwiw.
| conradev wrote:
| This makes a lot of sense to me through an efficiency lens.
|
| Facebook could spend a lot of money to get engineers beefy
| workstations, and then have each of these workstations clone
| the same repository and build the same index locally.
|
| Or, they could leverage the custom built servers in their
| data centers (which are already more energy-efficient than
| the laptops), build a single index of the repo, and serve
| queries on-demand from IDEs throughout the company.
|
| I could also see an analytics angle to this if it could
| incorporate history and track engineering trends over time.
| In my experience, decision making in engineering around
| codebase maintenance is usually rooted in "experience" or
| "educating guessing" rather than identifying areas of high
| churn in the codebase or what not.
| masukomi wrote:
| 100% same take.
|
| I'd add that I didn't want to click "get started" because i
| didn't know if it was a thing i wanted, and then "get
| started" actually took me to documentation, which is not what
| i expect from a "get started" button. The Documentation had
| the presumption that i wanted to use it, and thus the
| implication that i knew wtf "it" was.
|
| I don't care about its efficiency, or declarative language,
| or any of that when i still don't know what we're talking
| about.
| WastingMyTime89 wrote:
| > It seems like a lot of hoops to jump through when Visual
| Studio (and Visual Studio Code) can index a very large
| codebase in a few seconds.
|
| I think you are not thinking large enough. An IDE absolutely
| can not index a very large codebase and allow users to make
| complex queries on it. Think multiple millions lines of code
| here. The use case is closer to "find me all the variables of
| this type or a type derived from it in all the projects at
| Facebook" than "go to this definition in the project I'm
| currently editing".
| Syzygies wrote:
| There's large, and there's scope. I use VSCode to dabble in
| dozens of projects across a dozen languages at a time,
| often coming back to fix things after years. VSCode is
| great at telling me what I did in the current project, but
| I can't remember library calls or even syntax without
| looking at something I wrote before. My efficiency is
| perhaps 50% at recalling where to look; a tool that kept my
| entire corpus at my fingertips would be extremely welcome.
| But I'm failing to see how this is that.
| fragmede wrote:
| If you've not had to deal with a codebase that takes
| VSCode longer than a few minutes to index, then you're
| probably outside their initial target market. If you've
| not had to setup a hosted code search tool (eg livegrep
| https://github.com/livegrep/livegrep ) because there's
| just too much code, you've been lucky. If your projects
| can be scoped, and not pull in code from dozens of
| libraries, across dozens of teams, many of which are on
| different continents, you're doing a better job of
| organizing code than I've been able to manage.
| the_duke wrote:
| This is really cool.
|
| Seems like there are only indexers for Flow and Hack though.
|
| Will there be more indexers built by Facebook, or will it rely
| on community contributions?
| dons wrote:
| A bit of both I think.
| simonmar wrote:
| There will be more indexers: we have Python, C++/Objective C,
| Rust, Java and Haskell. It's just a case of getting them
| ready to open source. You can see the schemas for most of
| these already in the repo: https://github.com/facebookincubat
| or/Glean/tree/main/glean/s...
| mhitza wrote:
| Briefly skimmed the docs and it noted that it doesn't store
| expressions from the parsed AST. That means it's mostly a
| symbol lookup system?
|
| When doing large system refactoring searching by code patterns
| is the number one thing I'd like to have a tool for. For
| example being able to query for all for loops in a codebase
| that have a call to function X within their body.
| soonnow wrote:
| Does that mean you are using the shell or how is it used to
| enable these functionalities?
| dons wrote:
| Most clients hit the Glean server via the network
| (thrift/JSON) and then mostly via language bindings to the
| Glean query language, Angle. The shell is more for
| debugging/exploration.
|
| Imagine an IDE plugin that queries Glean over the network for
| symbol information about the current file, then shows that on
| hover. That sort of thing.
| soonnow wrote:
| Alright gotcha. Thanks for the clarification.
| zerr wrote:
| Since this is HN, could you please share more technical/impl
| details, e.g. what makes it more scalable and faster in general
| and also compared to other similar engines?
| gravypod wrote:
| I see you support Thrift and Buck. Would you also be interested
| in adding Proto and Bazel support? Being able to query the code
| based on the build graph (sort of) would be very cool.
| z3t4 wrote:
| This seems very interesting, would love to see more alternatives
| to TreeSitter and microsoft LSP - what makes those hard to use is
| lack of examples and tutorials. So I hope tbere will be examples
| and tutorials. For example: How do you find all variables in
| scope when the text cursor is on line x and col y in
| /file/path/file.js
| marcodiego wrote:
| Is it a modern cscope?
| booleandilemma wrote:
| The very first page of the site should have examples of what you
| can do with it.
| enjikaka wrote:
| Ew, Facebook.
| conductor wrote:
| To prevent any confusion, this is a different product than
| Mozilla's Glean [0][1].
|
| [0] https://docs.telemetry.mozilla.org/concepts/glean/glean.html
|
| [1] https://github.com/mozilla/glean/
| senden9 wrote:
| I was also confused first if (Mozilla) glean gained some out-
| of-scope features.
| [deleted]
| coderdd wrote:
| Great to see this space moving! Any pointers on diff vs Kythe?
| Kythe has a mostly fixed schema, for one.
|
| One of the pain points using Kythe is wiring up the indexer to
| the build system. Would Glean indexers be easier to wire up for
| the common cases?
|
| Other is the index post-processing, which is not very scalable in
| the open source version (due to go-beam having rough Flunk
| support, for example).
|
| Third, how does it link up references across compilation units?
| Is it heuristic, or relies on unique keys from indexers matching?
| Or across languages?
| simonmar wrote:
| Kythe has one schema, whereas with Glean each language has its
| own schema with arbitrary amounts of language-specific detail.
| You can get a language-agnostic view by defining an abstraction
| layer as a schema. Our current (work in progress) language-
| agnostic layer is called "codemarkup"
| https://github.com/facebookincubator/Glean/blob/main/glean/s...
|
| For wiring up the indexer, there are various methods, it tends
| to depend very much on the language and the build system. For
| Flow for example, Glean output is just built into the
| typechecker, you just run it with some flags to spit out the
| Glean data. For C++, you need to get the compiler flags from
| the build system to pass to the Clang frontend. For Java the
| indexer is a compiler plugin; for Python it's built on libCST.
| Some indexers send their data directly to a Glean server,
| others generate files of JSON that get sent using a separate
| command-line tool.
|
| References use different methods depending on the language. For
| Flow for example there is a fact for an import that matches up
| with a fact for the export in the other file. For C++ there are
| facts that connect declarations with definitions, and
| references with declarations.
| mrazomor wrote:
| In case using Kythe was an option, what was the rationale for
| not using it?
|
| One major limitation of Kythe is handling different versions.
| For example, Kythe can produce a well connected index of
| Stackage, but a Hackage would have many holes (not all
| references would be found, as the unique reference name needs
| the library version). How Glean handles different library
| versions?
|
| EDIT: the language agnostic view is already mentioned.
| Game_Ender wrote:
| Is there an example of using the C++ indexer? I saw hack and
| JS on your site but missed C++ (Python would also be
| amazing!).
| simonmar wrote:
| We want to open-source the C++ and Python indexers but
| they're not ready yet - we have to separate them from
| internal build-system-specific bits.
| ctvo wrote:
| Great job with this. What's your roadmap for releasing some of
| the tooling for editor integration? Really, the question is
| should I build something or wait a few weeks?
| erlich wrote:
| I can't believe Facebook hasn't canned Flowtype yet and moved to
| TypeScript. They will have to do it eventually.
| muglug wrote:
| I'm not sure you understand the scale at which Facebook
| operates. They don't have to do anything.
|
| As long as billions of people keep using Facebook they can
| maintain their own static analysis tooling for Javascript for
| as long as they want.
| scns wrote:
| You do have a point, a rewrite on that scale would be a
| colossal waste of manyears/$$. Your delivery could be nicer
| though.
| wingspan wrote:
| The problem is that TypeScript does not scale to the size of
| the giant monorepo at Facebook, with hundreds of thousands, if
| not millions of files. Since they aren't organized into
| packages, it is just one giant flat namespace (any JS file can
| import any other JS file by the filename). It is pretty amazing
| to change a core file and see type errors across the entire
| codebase in a few seconds. The main way to scale in TypeScript
| is Project References, which don't work when you haven't
| separated your code into packages. (Worked at Facebook until
| June 2021).
| ctvo wrote:
| Doesn't look like they're stopping their use of Hack either.
| Eventually is a long time so you're right.
| simonw wrote:
| Feature request: a live demo! I would love to try out the web
| interface described at https://glean.software/docs/trying without
| pulling down a 7GB Docker image first.
| [deleted]
| jamessb wrote:
| Even just a short video of someone using the web interface
| would be helpful.
| _jezell_ wrote:
| Is this basically Facebook's version of SourceGraph?
| doddsiedodds wrote:
| An excellent talk by Simon Marlow on Glean here:
| https://youtu.be/-OPN7QPsYKE
| simonmar wrote:
| I should point out that Glean has evolved quite a bit since
| that talk!
| justinmchase wrote:
| But whats an example of a fact? Looks cool but I have no idea
| what its for.
| da39a3ee wrote:
| I was recently looking for a library that takes a few lines of
| source code as input, and predicts the programming language as
| output.
|
| That seems like a very tractable machine learning problem, yet
| all I could find was a single python library which looks nice,
| but doesn't have much adoption, and requires installing the
| entirety of tensorflow despite the fact that users just want a
| trained model and a predict() function.
|
| Why doesn't a popular library like this exist?
| jamessb wrote:
| GitHub's linguist library can be used to identify the
| programming language of a single file (edit: or of a whole
| project): https://github.com/github/linguist#single-file
| da39a3ee wrote:
| Thanks! My searches completely failed to find that. I can't
| use it as a ruby library, but perhaps I can pull out the
| heuristics.yml and the naive bayes classifier weights to use
| in another language.
| ExtraE wrote:
| What, uh, is this? This is a space that I'm not familiar with and
| the linked site doesn't make it super clear.
| Grimm1 wrote:
| Very cool! How does this differ algorithmically from the trigram
| based search that everything uses from google code search from
| like 20 years ago?
|
| And continuing off of that theme in practical terms how does it
| stand up against zoekt?
|
| I'm curious because zoekt is kind of slow when it comes to
| ingesting large amounts of code like all of the publicly
| available code on GitHub
|
| The few people using that commercially have basically had to
| spend a lot of time rewriting parts of it to make their goal of
| public codesearch for all attainable.
|
| I and a few people I know are pretty convinced that there are
| better and easier ways / technologies to make that happen.
___________________________________________________________________
(page generated 2021-08-31 23:02 UTC)