[HN Gopher] Stack Graphs
___________________________________________________________________
Stack Graphs
Author : todsacerdoti
Score : 220 points
Date : 2021-12-09 17:51 UTC (5 hours ago)
(HTM) web link (github.blog)
(TXT) w3m dump (github.blog)
| spinningslate wrote:
| Cool to see this is inspired by Eelco Visser's [1] group at TU-
| Delft, who originated the concept of scope graphs [2].
|
| The back story of their language design research and tools is
| worth reading [3]. Not least because Nix originated as their
| build system!
|
| [1] https://eelcovisser.org/
|
| [2] https://github.blog/2021-12-09-introducing-stack-graphs/
|
| [3] https://eelcovisser.org/blog/2021/02/08/spoofax-mip/
|
| --
|
| EDIT: Fixed formatting.
| dcreager wrote:
| Hard agree, this is very much a "standing on the shoulders of
| giants" situation. Eelco's group has done amazing work on scope
| graphs, and we would not have been able to make this without
| that work as a foundation.
| spinningslate wrote:
| ...and I should have said: great that you acknowledged it in
| your blog too.
| gorgoiler wrote:
| How do I plumb this precise code jumping into vim?
|
| My tags files just don't seem to cut it any more.
| dcreager wrote:
| The most straightforward option would probably be to write an
| LSP wrapper around the stack graphs code. One of the people on
| my team wrote a very "contrib directory" version of that for
| internal testing, which lets us test our stack graph rules for
| a new language in something like VS Code before deploying to
| production. If someone were to write a more polished version of
| that, that would be a great way to get code nav into any LSP-
| compatible editor for any language that supports stack graphs.
| bbkane wrote:
| NeoVim has Tree-sitter and Language Server integration. I
| couldn't make it work for me, bit lots of people really love
| it!
| botdan wrote:
| If you haven't tried again recently the neovim team has done
| a ton of work updating the documentation on [nvim-
| lspconfig](https://github.com/neovim/nvim-lspconfig). There's
| also projects like [kickstart.nvim](https://github.com/nvim-
| lua/kickstart.nvim) which aim to provide a very simple
| starting point for new users. It's "batteries-included"
| neovim which notably includes LSP, TreeSitter, completion
| engines, and some basic git functionality.
| jandinter wrote:
| Looks neat!
|
| I wish this was available for legal texts, making it easy to jump
| from one law to the referenced next legal provision. Many legal
| provisions, especially in very regulated areas, make use of
| "functions" "imported" from other, totally different laws.
|
| Sorry for being off-topic, but if anyone knows a resource for
| that, I am super interested!
| earth_walker wrote:
| I started doing this for a niche area: US and European
| regulations and guidance documents for Good Laboratory
| Practice, and later for Canadian Cannabis regulations.
| Basically I created a standard XML schema for regulations and
| parsed them into XML [1]. This allowed for e.g. presenting
| tables of contents and section folding, pulling and linking
| definitions into their own search engine, etc. [2]
|
| I thought that I could easily write a parser for each
| jurisdiction's formats, and then get predicate rules and
| related regulations for free.
|
| I was wrong. a) there are many jurisdictions and sub-groups all
| doing their own thing; and b) most don't have any standard
| document formatting or tagging, let alone a defined structure.
| Even in the most structured formats (like the US eCFR's XML)
| the focus is on display rather than content. In the worst cases
| it was just whoever wrote up the Word document chose how they
| numbered and formatted chapters and sections etc.
|
| There were so many special cases that it was a huge amount of
| work to add or update each document, and I ended up doing a lot
| of categorization and fixing by hand.
|
| [1] I know people hate XML on HN, but I did my research and had
| specific reasons for choosing it at the time, including human
| readable, nesting sections, being able to easily publish and
| validate a schema, etc.
|
| [2] See ReadtheRegs.com. You can browse the definitions page
| without an account.
| jandinter wrote:
| This looks great! I share your sentiment: I looked into the
| XML files for the published German legal texts[1], and they
| seem to be made for display purposes only.
|
| [1] Table of contents for XML files: https://www.gesetze-im-
| internet.de/gii-toc.xml
| masklinn wrote:
| > I wish this was available for legal texts, making it easy to
| jump from one law to the referenced next legal provision. Many
| legal provisions, especially in very regulated areas, make use
| of "functions" "imported" from other, totally different laws.
|
| I mean, it "is", to the extent that if you put in the work of
| hyperlinking all the things during the digitizing process they
| can be.
|
| Legifrance is fairly highly (though nowhere near completely)
| hyperlinked for instance, here's one of the laws I selected
| from the front page:
| https://www.legifrance.gouv.fr/jorf/id/JORFTEXT000044446848
|
| Many (though nowhere near all) the legal texts being
| referenced, modified, or inserted (as references, into other
| texts) are hyperlinked.
| ova-throwaway wrote:
| ? baby lawyer and former dev here: don't we have that anyways?
| E.g. on Casetext, Lexis, all the usual legal research sites.
|
| I personally haven't encountered a situation where it was
| _totally_ lacking.
| jandinter wrote:
| Many available options seem to be based on manual annotation
| and, therefore, cover a limited range of all legal texts.
| Especially with regard to regulatory topics, those research
| sites usually fall short.
| ss108 wrote:
| I would be interested to know where you're encountering
| these issues, specifically. I'm interested in legal tech,
| would like to know where the gaps are
| cloogshicer wrote:
| I just want to say: This is an absolutely fantastic blog post. So
| well written, and really well done graphics that help
| understanding.
| dcreager wrote:
| Thanks very much, I appreciate the kind words!
| armchairhacker wrote:
| is this from Github semantic
| (https://github.com/github/semantic)?
|
| Seems very suspicious since it's the same goal using the same
| technologies. The latest commit is 4mo ago but i assume they have
| a closed-source version they've been working on.
| dcreager wrote:
| It's from the same team (which I am the manager of), but it's
| not using that same codebase. In Semantic, you would have to
| write Haskell code to add support for a new language, and we've
| found that the declarative DSLs that tree-sitter provides are a
| lower barrier to entry. (Semantic also uses tree-sitter for the
| parsing step, btw.) We do still have plans for Semantic, but
| our stack graph code does not live there.
|
| _[Edit]_ Also, the stack graph implementation is also open-
| source, just like Semantic, and we do our development on the
| core algorithms directly there. The Python extraction rules
| have not yet been moved over to the public tree-sitter-python
| language repo, but that 's on the docket. Future language
| support would happen directly in each language's public open-
| source tree-sitter repo.
|
| https://github.com/github/stack-graphs/
| parhamn wrote:
| Are there any editors that let you edit code in a structure
| closer to the way the language handles it (e.g. graphs/stacks)?
|
| You could achieve a lot more than the conventional file-per-
| module approach. Off the top of my head benefits could include:
| much easier refactoring, comments bound to specific tokens,
| function level versioning, much smarter git diffs, and so much
| more.
|
| The mapping graphs to flat-files thing feels especially silly
| when I do FE dev. Manipulating JSX doms on top of the JS function
| stack is a constant reminder of how the flattening step feels
| unnecessary.
| sweetsocks21 wrote:
| I don't know of any generic editor that can take a current
| (popular) language and do that. There is however a lot going on
| in this space, and I'll list some examples. Most are about
| editing the tree structure better, but some do branch out a bit
| more to the graph idea.
|
| https://docs.darklang.com/structured-editing
|
| https://hazel.org/
|
| http://lighttable.com/
|
| Emacs has some structural editing plugins like ParEdit
|
| Many graph/node based editors like Blender shaders (3D),
| Reaktor (music)
|
| https://www.unisonweb.org/ (this has some of the function
| versioning ideas)
|
| https://enso.org/
|
| Smalltalk
|
| There's many many more, these are just some I remembered off
| the top of my head.
| parhamn wrote:
| I appreciate the links!
| hiaux0 wrote:
| The Dion editor [1] is trying to tackling this problem, if I
| understand correctly. [2] has some sweet GIFs for you.
|
| [1]: https://dion.systems/dion_format.html
|
| [2]: https://dion.systems/gallery.html
| ball_of_lint wrote:
| Lisp + SLIME is probably the closest thing that exists to what
| you're describing.
| dcreager wrote:
| OP author here. I also gave a talk about this at Strange Loop
| back in October if folks want to watch/listen instead of read:
| https://dcreager.net/talks/2021-strange-loop/
| dsanchez97 wrote:
| I haven't had time to watch the full talk yet, so sorry if this
| is answered there.
|
| When python resolves 'import' statements, it looks for the
| modules based on the PYTHONPATH. Although not done that often,
| it is possible to modify the PYTHONPATH at runtime, changing
| what an imported symbol will resolve to. How do you handle
| situations like that?
|
| Just from a hypothetical stand point, someone could take
| advantage of this to make it seem like the library is linking
| to a safe implementation of a function such that when using
| this feature people are directed to the safe implementation.
| Then at runtime without the user knowing, they could
| dynamically change the PYTHONPATH so a malicious version of the
| function is loaded.
| dcreager wrote:
| Ooh that's a good one. Right now, our lookups are only within
| the single repository. And so if you had two files that
| _could_ provide the same (fully qualified) symbol, we aren't
| doing any PYTHONPATH analysis to determine which one it is.
| We'll show you both.
|
| We do eventually want to support cross-repository use cases,
| and there, the answer boils down to needing to find the set
| of dependencies in which to do the search. One we have that,
| it's no different than an in-repo case -- we look for any
| file in _any_ of the repos (yours and your dependencies) that
| could provide the symbol that we 're currently looking for.
|
| So, short version, we'd be aiming for a solution where we'd
| be able to show you both the "good" and "bad" definitions,
| and let you the user decide how to use that information.
| dsanchez97 wrote:
| I have been doing some stuff where I analyze python code
| via the AST to try to figure out symbol reference so it was
| top of mind when I read the article. My tool works at
| runtime by importing the users code as module, which means
| all the symbols are evaluated by the python interpreter and
| then I can inspect the loaded module to determine the
| references. This is all part of a larger framework that has
| lifecycle rules for how/when it will load user defined
| code, which allows me some flexibility and information.
|
| Even with that flexibility, there are still some things
| that just weren't possible because of how configurable
| python is at runtime. For example, someone could write a
| factory style class that dynamically creates python object
| instances based on a passed in string that represents the
| class the object will be of. Then they could pass user
| input into this factory making the created objects
| completely dependent on runtime input.
|
| I would wage 99% of python written doesn't use these kinds
| of runtime abilities, and it probably isn't a great
| practice to use them in general from a maintainability
| point of view but they do exist. My solution to this is
| that if you are sophisticated enough to be using these
| features then you should be able to understand why my tool
| can't capture that information from the AST.
|
| Not sure if that solution would work for what you are
| working on, but I figured I'd let you know about my
| experience because it can get gnarly quickly once you start
| thinking about all the things that are possible in python.
| dcreager wrote:
| Yes! Those are exactly the kinds of examples that mess up
| this kind of analysis. Anything where the _structure_ of
| your program depends on arbitrary computation:
| https://twitter.com/dcreager/status/1467654252516589571
| dsanchez97 wrote:
| Yep I feel you on that making life harder for these kinds
| of tools!
| jkaptur wrote:
| I think the (unstated) expectation with code navigation tools
| is that they are best-effort. Beyond your example, plenty of
| languages allow exotic and dynamic runtime behavior -
| eval(..)ing user/network input, monkeypatching, etc. - that
| makes it impossible to know a priori exactly what a call site
| might invoke.
| one_off_comment wrote:
| For people like me who don't have time to watch the talk,
| what's the answer to the question posed on the blog post? "Why
| aren't we using the Language Server Protocol (LSP) or Language
| Server Index Format (LSIF)?"
| dcreager wrote:
| I go into some amount of detail in a talk I gave at last
| year's FOSDEM: https://dcreager.net/talks/2020-fosdem/
|
| For LSP, the short version is that running separate sidecar
| services in production for every language that we want to
| support is a complete non-starter. That would completely eat
| up my team's time budget handling operational duties.
|
| LSIF is a great technology that lets you run LSP servers in a
| "batch" mode. But we _really_ need our analysis to be
| incremental, where we can reuse results for unchanged files
| when new commits come in. Language servers tend to do
| monolithic analyses, where _every_ file needs to be
| reanalyzed whenever _any_ new commit comes in. If you want to
| analyze your dependencies, as well, that exacerbates the
| problem. _LSIF_ (the data format) has recently grown the
| ability to produce incremental data, but that requires
| language servers to work in an incremental mode as well. Very
| few (if any?) do, and because language servers tend to piggy-
| back on existing compiler technology (which is also not
| typically incremental), it will be a heavy lift to get
| incrementality into the LSP /LSIF world.
|
| Whereas stack graphs have incrementality out of the box.
| (This was the primary thing that we added to "scope graphs",
| the academic framework that stack graphs are built on.) It's
| the core algorithm (which is implemented once for all
| languages) where the incrementality happens. The only
| language-specific parts are figuring out which graph
| structures you need to create to mimic the name binding rules
| of your language.
| billconan wrote:
| Thank you very much for this and for open sourcing it!
|
| For production, is there a good database system that can index
| this graph structure?
|
| for incremental update, how do you prune deprecated part of the
| graph (for example, removed/renamed files/functions?)
|
| and for this example
|
| (function_definition name: (identifier) @name) @function {
| node @function.def attr (@function.def) kind =
| "definition" attr (@function.def) symbol = @name
| edge @function.containing_scope -> @function.def
|
| }
|
| how can it guarantee the python shadowing rule? it doesn't seem
| to encode any order preference. does the code traverse the
| source file in the reverse order basically?
|
| And, probably not closely related to stack graph, but about
| using tree-sitter for c/c++ understanding, how to handle the
| preprocessor?
|
| because the c preprocessor can make the code look like a
| completely different language and mess up the parser.
|
| And how to prune and simplify CST to AST at scale (supporting
| many languages)?
| dcreager wrote:
| > And, probably not closely related to stack graph, but about
| using tree-sitter for c/c++ understanding, how to handle the
| preprocessor?
|
| Ha yeah that's a good question. Some uses of the preprocessor
| won't be problematic -- it would require deep token mangling,
| for instance, to really start to cause a problem. You can
| treat more basic `#ifdef` style conditional compilation as
| parsing/analyzing both sides and showing both as potential
| definitions. (And from there you could extend it further to
| try to identify (or define) "profiles" that have different
| preprocessor symbols defined, and use that to actually prune
| some of the results.)
| billconan wrote:
| Thank you very much for the answers! This is a great work!
|
| I'm thinking maybe stack graph can be used to understand
| the preprocessor. finding the original toggle/condition
| that turns on/off a #ifdef block.
|
| I heard a simple c++ hello world contains 5000 #defines
| introduced by standard libs. if stack graph can improve
| exhaustive search somehow, that would be awesome.
| dcreager wrote:
| > And how to prune and simplify CST to AST at scale
| (supporting many languages)?
|
| We're not doing any pruning or CST-AST translation, we just
| operate directly on the CST. With the new graph DSL you
| should be able to implement something like that, since an AST
| is a tree, and a tree is one shape of graph that you could
| create. For our purposes, that isn't a meaningfully useful
| step, since we can just as easily generate the stack graph
| structures that we need directly from the CST we get from the
| tree-sitter grammar.
| dcreager wrote:
| > For production, is there a good database system that can
| index this graph structure?
|
| For awhile, we were storing this in a (very large) MySQL
| database, sharded with Vitess. The sharding behavior worked
| great (since repo ID gives you a nice sharding key), but we
| found that it wasn't elastic enough for our needs, since we
| quickly filled up the available capacity of the machines that
| we had reserved.
|
| Since then we've switched over to storing this data in Azure
| Blob Storage, basically using it as a glorified key/value
| store. We had to write custom logic for deciding how to
| structure our data so that we can efficiently write it at
| index time and read it at query time, but so far it's been
| working quite nicely!
|
| > for incremental update, how do you prune deprecated part of
| the graph
|
| Short version is that we're storing everything on a per-file
| basis. So whenever a file is changed, we generate a new stack
| graph snippet for that file. There might be lots of content
| in that stack graph that is identical to the stack graph of
| the previous version of the file, but we don't try to do any
| structural sharing more fine-grained than the file.
|
| Right now we aren't going in any pruning old files that
| aren't being touched by any active queries, but we could. Or
| move it to a colder storage tier in Blob Storage, something
| like that. At least for now, the marginal costs of storing
| the data for longer aren't our cost bottleneck.
| dcreager wrote:
| > how can it guarantee the python shadowing rule? it doesn't
| seem to encode any order preference. does the code traverse
| the source file in the reverse order basically?
|
| That snippet of graph DSL does not show the precedences being
| applied, but if you look at the diagram a bit earlier in the
| post, you'll see that some of edges do have precedence values
| applied. In the graph DSL, that would appear as an additional
| statement in the stanza: attr
| (@function.containing_scope -> @function.def) precedence = 1
| dahart wrote:
| This is way cool!! I don't have any deep questions yet, but to
| start I'm a bit curious about some very minor things like the
| name and description. I'm curious why "stack graph" as opposed
| to "call graph" or "callstack graph" or something like that,
| I'm guessing you do have some thoughts there. Also curious
| about the way you described it, the process overall certainly
| sounds a lot like parsing, compiling, and linking at the end,
| but you haven't really used that analogy. I guess I'm just
| wondering if you're framing the name and description carefully
| and if there are specific reasons you'd be willing to discuss.
| dcreager wrote:
| Great questions! The framework is based on some great
| existing academic work from Eelco Visser's group at TU Delft.
| Their framework is called "scope graphs":
| https://pl.ewi.tudelft.nl/research/projects/scope-graphs/
|
| We extended scope graphs to have the symbol stack (described
| in OP) and also a "scope stack", which allows us to support
| the more advanced examples that I alluded to at the end. So
| we chose the name "stack graphs" because it was "scope graphs
| but using stacks".
| ZeroCool2u wrote:
| Just out of curiosity, what's the timeline look like for adding
| precise code navigation to other supported languages and what
| languages do you think will get support first?
|
| No need for precise answers, just wondering which ones we're
| likely to see next after Python :)
|
| Also, I'm looking at the list of supported languages here[1].
| Maybe you're not the right person to ask, but are there any
| plans to add support for one of the lower level / systems
| programming languages like C, C++, or Rust, etc?
|
| Finally, thank you so much for you and your teams hard work.
| This feature is _incredibly_ helpful, especially in Python!
|
| [1]: https://docs.github.com/en/repositories/working-with-
| files/u...
| dcreager wrote:
| It's not "easy", but we've found that because it's based on a
| declarative DSL it's less effort than you might expect. And
| we're finding that there are common patterns that you use in
| your graph construction rules, because there are many aspects
| of name binding that end up working the same way in different
| languages. So, hand-wavily (not out of secrecy but because of
| not having rigorous data yet), we're finding that it's
| O(months) to get a new language out the door.
|
| We do have a couple of other languages in the pipeline that
| my team has been working on, both in terms of writing stack
| graph rules to get precise support, and also to write "fuzzy"
| tagging rules to get search-based support. And we definitely
| do plan to include lower level languages like the ones you
| mentioned.
|
| Lastly, one major reason that we're doing all of this in
| open-source projects is that we want to ensure that language
| communities can self-serve support for their languages,
| should they wish to. That will be especially useful for the
| long tail of languages that my team will honestly never be
| able to get to ourselves. We have some work to do to get the
| documentation written to properly support self-serve stack
| graph rules, but it's definitely a goal that we're aiming
| for.
| ZeroCool2u wrote:
| Got it, that's really helpful in terms of estimating the
| work involved. Thanks for taking the time to answer all
| these questions and congrats again on the release!
| chefandy wrote:
| As an aside to the technical conversation-- I appreciate the
| use of cooking as an analogy in your code samples in the posted
| article.
|
| It's not just because I was a chef! Using nouns, adjectives and
| verbs from (familiar) concrete hierarchical analogies makes
| technical writing much more accessible. It reduces cognitive
| load by implying more about the structure of the relationship
| than variables like "intVar" and functions like "printVar" tied
| together in entirely contrived, abstract ways. In particular,
| newer developers, or ones unfamiliar with your language
| paradigms will benefit heartily.
|
| I implore my fellow developers to follow suit.
___________________________________________________________________
(page generated 2021-12-09 23:00 UTC)