[HN Gopher] AST-grep(sg) is a CLI tool for code structural searc...
___________________________________________________________________
AST-grep(sg) is a CLI tool for code structural search, lint, and
rewriting
Author : methou
Score : 213 points
Date : 2023-12-10 12:03 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| beardedwizard wrote:
| Is this meant to compliment or compete with semgrep?
| andrewshadura wrote:
| Well, it _is_ semgrep (hence sg).
| beardedwizard wrote:
| yeah I had this feeling a bit, I guess im curious what
| problems they solve differently (if any). My sense it that
| semgrep is an enterprise managed solution of the same kind
| (and btw, is still itself OSS)
| ekidd wrote:
| Well, when I seach for "semgrep", I get a very nice corporate
| landing page with a "Book Demo" button. Which is a level of
| hassle that just isn't worth it for smaller teams, because
| "Book Demo" usually means "We're going to do a dance to see how
| much money we can extract from you." Which smaller teams may
| only want to do for a handful of key tools.
|
| (4 years ago, I was more willing to put up with enterprise
| licensing. But in the last two years, I've seen way too many
| enterprise vendors try to squeeze every penny they can get from
| existing clients. An enterprise sales process now often means
| "Expect 30% annual price hikes once you're in too deep to back
| out." The lack of easy VC money seems to have made some
| enterprise vendors pretty desperate.)
|
| There's also an open source "semgrep" project here:
| https://github.com/semgrep/semgrep. But this seems to be
| basically a vulernability scanner, going by the README.
|
| Whereas AST-grep seems to focus heavily on things like:
|
| 1. One-off searching: "Search my tree for this pattern."
|
| 2. Refactoring: "Replace this pattern with this other pattern."
|
| AST-grep also includes a vulnerability scanning mode like
| semgrep.
|
| It's possible that semgrep also has nice support for (1) and
| (2), but it isn't clearly visible on their corporate landing
| page or the first open source README I found.
| icholy wrote:
| Semgrep is capable of one-off searching and refactoring. I
| agreed that the docs are a little hard to navigate.
| herrington_d wrote:
| Thank ekidd for your kind words! ast-grep author here. This
| is a hobby project and mainly focuses on developers' daily
| job like search and linting. Appreciate you like it!
|
| Semgrep's vulnerability scanning is much more advanced,
| mostly for enterprise security usage.
| icholy wrote:
| Looks like a competitor to me.
| herrington_d wrote:
| Hi, ast-grep author here. This is a great question and I asked
| this in the first place before I started the hobby project.
|
| TLDR; I designed ast-grep to be on different tracks than
| semgrep.
|
| Semgrep is for security and ast-grep is for development.
|
| First and foremost, I have always been in awe of semgrep.
| Semgrep's documentation, product sites and Padioleau's podcast
| all gave me a lot of inspiration. Using code to find code is
| such a cool idea that I never need to craft an intricate regex
| or write a lengthy AST program. sgrep and patch from
| https://github.com/facebookarchive/pfff/wiki/Sgrep have helped
| me a lot in real large codebases.
|
| When I used semgrep as a software engineer, instead of a
| security researcher, I found semgrep has not touched too much
| on routine development works. I can use `semgrep -e PATTERN`
| but the Python wrapper is not too fast compared to grep. While
| pattern is cool, it cannot precisely match some syntax nodes.
| (example, selecting generator expression in Semgrep is very
| hard). It also does not have API to find code programmatically.
|
| I have also a short summary for tool comparison. https://ast-
| grep.github.io/advanced/tool-comparison.html
| herrington_d wrote:
| Why I think semgrep is a security tool different from ast-
| grep:
|
| * Semgrep is security focused. It has many advanced static
| analysis features in its core product, such as dataflow
| analysis, symbolic propagation, and semantic equivalence, all
| of which are useful for security analysis. They are not
| available in ast-grep. * Semgrep's pattern syntax also
| prefers matching more potentially vulnerable semantics than
| matching precise syntax. Semantic level information is the
| better level of abstraction for security model. ast-grep, on
| the other hand, sticks to faithfully translating users'
| queries syntactically. * Semgrep has a one-off search and
| rewrite feature, but it is not its primary focus. The CLI is
| a bit slow compared to other tools. ast-grep strives to be a
| fast CLI tool. * Semgrep has a product matrix for
| vulnerability detection: detecting secrets, supply chain
| vulnerabilities, and cross-file detection. It also has a
| plethora of security rules in the registry. These features
| will not be included in ast-grep.
| hprotagonist wrote:
| Nice to see treesitter showing up in tools that aren't just
| syntax highlighting.
| herrington_d wrote:
| treesitter gives us a uniform interface to parse and manipulate
| code, which is awe-inspiring work. I wish tree-sitter could
| have more contributors to the core library. It still has a lot
| of improvement space.
|
| Say, like performance. tree-sitter's initial parsing speed can
| be easily beaten by a carefully hand-crafted parser. Tree-
| sitter, written in C, has a similar JavaScript parsing speed as
| Babel, a JS-based parser. See the benchmark
| https://dev.to/herrington_darkholme/benchmark-typescript-par...
| teo_zero wrote:
| Besides, it doesn't shine at syntax highlighting, either! In
| the sense that it doesn't add anything that the traditional
| text-based algorithms embedded in practically any text editor
| can't already do. For example, if I declare a variable called
| "something", it should highlight all successive occurrences of
| "something" in a remarkably different style than "somethink".
| And the "a" in "sizeof(a)" should be rendered differently when
| it's a variable and when it's a type.
| gpuhacker wrote:
| Does anyone happen to know of a similar tool that can compare two
| codes for semantic similarity?
| LelouBil wrote:
| Maybe look here (never used it though)
|
| https://github.com/Wilfred/difftastic
| dorian-graph wrote:
| Or https://github.com/afnanenayet/diffsitter. I've tried both
| and I like them. No preference or notable opinions on them
| yet!
| _a_a_a_ wrote:
| define 'semantic similarity'
|
| would your hoped-for tool recognise that 1
|
| and sin(x)^2 + cos(x)^2
|
| are the same? (I think that identity holds, but if not you get
| the picture)
| _a_a_a_ wrote:
| to the downvoter: I thought this was a reasonable question?
| Semantic equivalence is IIRC undecidable in general. Some
| languages (Backus' FL?) try to deal with that but I dunno.
| tyingq wrote:
| > Semantic equivalence is IIRC undecidable in general.
|
| They did mention code, and said "similarity" rather than
| equivalence.
|
| But, as a trivial example, two different pieces of code can
| compile down to the same AST, or bytecode, or assembler.
| mst wrote:
| That looks like a case where "analyse the AST after constant
| folding" might be a theoretical path if you had a language
| frontend that could emit the AST at that point.
|
| I suspect that things like "these two functions both start
| with the same conditional+early return" would be more useful
| to -me- given the sort of things I tend to be working on _.
| Also a 'fuzzy possible copy+paste detector' in general to
| help identify refactoring targets.
|
| It also strikes me that something that was mostly 'just' a
| structure-aware diff so e.g. you got diffs within-if-body and
| similar but I'm now into vigorous hand waving because it's
| been ages since I've thought about this and I probably need
| more coffee.
|
| _ I -did- do a pure maths degree many years ago but I don't
| generally seem to end up working on computational code
| thfuran wrote:
| Not with floats it isn't.
| _a_a_a_ wrote:
| umm, touche
| benmanns wrote:
| You could try embedding the two codes with an LLM and run any
| number of similarity measures on the output vectors.
| alexpovel wrote:
| Wow! What a coincidence. Just the other day I finished "v1" of a
| similar tool: https://github.com/alexpovel/srgn , calling it a
| combination of tr/sed, ripgrep and tree-sitter. It's more about
| editing code in-place, not finding matches.
|
| I've spent a lot of time trying to find similar tools, and even
| list them in the README, but `AST-grep` did not come up! I was a
| bit confused, as I was sure such a thing _must_ exist already.
| AST-grep looks much more capable and dynamic, great work,
| especially around the variable syntax.
| tekacs wrote:
| This looks really interesting, thank you for putting this
| together! I'll likely give it a go today. I say that as someone
| who has explored quite a few of these and found them mostly
| quite basic. srgn looks like more than the usual.
|
| One minor comment: I personally found the first Python example
| involving a docstring a little hard to parse (ha). It may show
| a variety of features, but in particular I found that it was
| hard to spot at a glance what had changed.
|
| If you could use diff formatting or a screenshot with color to
| show the differences it would make it much easier to follow. If
| I get around to using it later today, I might submit a PR for
| that. :)
| alexpovel wrote:
| > diff formatting
|
| Thank you for the feedback! That sounds good, I'll add that.
| alchemist1e9 wrote:
| Such an awesome idea and useful tool!
|
| Do you use tree-sitter for the AST part also?
| alexpovel wrote:
| Exactly, all the parsing is done by tree-sitter. The Rust
| bindings to the tree-sitter C lib are a "first-class
| consumer".
| eloh wrote:
| There is also a neovim plugin doing structural search/replace,
| also based on treesitter: https://github.com/cshuaimin/ssr.nvim
| wslh wrote:
| ELI5: should you specify the target language? The example is in
| TS, how we expand it to other programming languages?
| lyjackal wrote:
| I see an -l ts
|
| And an -l rs
|
| In the examples. Those target typescript and rust. Looks like
| it's built in tree-sitter, so presumably any language that
| supports that should work
| wslh wrote:
| I understand this approach is different from Semmle [1] (has
| queries and states). Do you know if they are modern
| alternatives to it?
|
| [1] https://en.wikipedia.org/wiki/Semmle
| simonw wrote:
| There is a list of supported languages here: https://ast-
| grep.github.io/guide/introduction.html#supported...
|
| If you leave off the language command line option it detects
| the language from the extension on your files.
| gushogg-blake wrote:
| I came up with a similar concept for in-editor SSR as an
| extension to existing find/replace functionality:
| https://codepatterns.io/
|
| It worked great for the use case I built it around initially but
| I think it would need a scripting/logic component to generalise
| to any conceivable refactoring.
| elric wrote:
| If you're into this sort of thing, there's OpenRewrite[1] for the
| Java ecosystem.
|
| [1] https://docs.openrewrite.org/
| anotherpaulg wrote:
| I'll share my similarly named tool `grep-ast` [0], which sort of
| does the opposite of the OP's `ast-grep`. The OP's tool lets you
| specify your search as a chunk of code/AST (and then do AST
| transforms on matches).
|
| My tool let's you grep a regex as usual, but shows you the
| matches in a helpful AST aware way. It works with most popular
| languages, thanks to tree-sitter.
|
| It uses the abstract syntax tree (AST) of the source code to show
| how the matching lines fit into the code structure. It shows
| relevant code from every layer of the AST, above and below the
| matches. It's useful when you're grepping to understand how
| functions, classes, variables etc are used within a non-trivial
| codebase.
|
| Here's a snippet that shows grep-ast searching the django repo.
| Notice that it finds `ROOT_URLCONF` and then shows you the method
| and class that contain the matching line, including a helpful
| part of the docstring. If you ran this in the terminal, it would
| also colorize the matches. django$ grep-ast
| ROOT_URLCONF middleware/locale.py: |from
| django.conf import settings |from django.conf.urls.i18n
| import is_language_prefix_patterns_used |from django.http
| import HttpResponseRedirect [?]... |class
| LocaleMiddleware(MiddlewareMixin): | """ |
| Parse a request and decide what translation | object to
| install in the current thread context. [?]... |
| def process_request(self, request): > urlconf =
| getattr(request, "urlconf", settings.ROOT_URLCONF)
|
| [0] https://github.com/paul-gauthier/grep-ast
| herrington_d wrote:
| Hey paulg, ast-grep author here! This is something I also want
| to do in ast-grep! ast-grep prints the surrounding lines around
| matches but they are not aware of which function/scope the
| matches are in. May I ask how you do the scope detection in a
| general fashion? (say language agnostic)
| https://github.com/ast-grep/ast-grep/issues/155
| anotherpaulg wrote:
| Nice, thanks for checking out grep-ast.
|
| The command line tool is a thin wrapper around the
| `TreeContext` class, whose purpose is show you a set of
| "lines of interest" in the context of the entire AST. This
| all exists because my other project aider [0] uses
| TreeContext to display a repository map [1] so that GPT-4 can
| understand how the most important classes, methods,
| functions, etc fit into the entire code base of a git
| repository.
|
| But it was easy to make a CLI interface to grep lines of
| interest and display them with TreeContext, and it turned out
| to be quite useful.
|
| The TreeContext class is line-oriented, and is mainly
| interested in tracking language constructs whose scope spans
| multiple lines. Typically these are things like classes,
| methods, functions, loops, if/else constructs, etc. Given a
| line of interest, we look at all the multi-line scopes which
| contain it. For each such multi-line scope, we want to
| display some "header" lines to provide context.
|
| In this example, the match for "two" is contained in the
| multi-line scopes of a method and a class. So we print their
| headers. $ grep-ast two example.py
| [?]... |class MyClass: | "MyClass is great"
| [?]... | def print2(self): >
| print("two") [?]...
|
| The trick is how to determine the header for each multi-line
| scope? It's not ideal to just use the first line. For
| example, it's nice that the header for the class included the
| docstring as well as the bare `class MyClass:` line.
|
| For any multi-line scope, we look at all the other AST scopes
| which start on the same line. We take the smallest such co-
| occurring scope, and declare the header to be the lines that
| it spans. For a simple method like `def print2(self):`,
| that's all that gets picked up.
|
| But a complex method like `print1()` below picks up all the
| lines which are part of its full function signature:
| $ grep-ast one example.py [?]... |class
| MyClass: | "MyClass is great" [?]... |
| def print1( | self, |
| prefix, | suffix, | ): [?]...
| > print(f"{prefix} one {suffix}") [?]...
|
| It's a heuristic, but it seems to work well in practice.
|
| [0] https://github.com/paul-gauthier/aider
|
| [1] https://aider.chat/docs/repomap.html
| svilen_dobrev wrote:
| hey.. are these tools (or combination there of) capable of
| converting parts of code in one language to another? Given no (or
| minimum) idiosyncracies... e.g. python to javascript or other way
| around? (And no, ML is not the answer, i need provable
| correctness)
| morgante wrote:
| I've done a lot of work in this space, and unfortunately the
| answer is largely no.
|
| These provide a nice frontend for writing simple rules, but I
| would not want to (essentially) write an entire transpiler in
| yaml.
|
| For Python->JavaScript, you likely want a transpiler focused
| specifically on that.
|
| Unfortunately, every such effort eventually hits serious limits
| in the emergent complexity for languages. There's a reason most
| of the SOTA techniques ML-based.
| herrington_d wrote:
| Provable correctness means you have to model your source and
| target languages. And then translate the source model to the
| target model. It is theoretically possible, but in practice,
| modeling an industry language is way too much work. Some
| languages do not even have a spec :/
| norir wrote:
| The problem with any tree-sitter based tool is that there will
| typically be edge cases where the tree-sitter parser is wrong.
| Probably not a big deal most of the time, but it makes me wary of
| using it for security.
| Noumenon72 wrote:
| What does it mean to use grep "for security"?
| richbell wrote:
| E.g., "I just read about CVE-2007-4559 being exploited in the
| wild. Are we using this vulnerable method?"
| Phelinofist wrote:
| So this is like a more general Coccinelle?
| morgante wrote:
| AST-grep is well done - the speed is particularly impressive and
| it's quite easy to get started with.
|
| One of the downsides of the simplicity is that rules are written
| in yaml. This works nicely for simple rules, but if you try to
| save a complex migration as a rule, you end up programming in
| YAML (which is very hard).
|
| For my similar tool we decided to build a full query language for
| matching code, called GritQL:
| https://docs.grit.io/tutorials/gritql
| herrington_d wrote:
| Hey morgante, nice to meet you again! Indeed YAML is a
| compromise between expressiveness and easy-learning. Grit did a
| great job for providing advanced code manipulation!
| da39a3ee wrote:
| This looks exciting. One thing I've always wanted to do is search
| Rust code but excluding code in tests (marked by a #[cfg(test)]
| annotation). Can it do that?
|
| I certainly hope some excellent AST-based CLI code search tools
| come to exist; hopefully this is one of them.
| herrington_d wrote:
| Of course, it gets you covered.
|
| https://ast-grep.github.io/playground.html#eyJtb2RlIjoiQ29uZ...
|
| I have the same problem also, haha,
| https://x.com/hd_nvim/status/1667059966111547392
| da39a3ee wrote:
| Thanks! How would you do that for a #[cfg(test)] attribute in
| Rust? (I believe that the true identifier of test code; `mod
| test {}` is just a convention). I assume Rust attributes
| "wrap" the AST node rooted at the node that follows them?
| simonw wrote:
| Something I find really interesting about this is the way the
| tool is packaged.
|
| You can install the CLI utility in four different ways:
| https://ast-grep.github.io/guide/quick-start.html#installati...
| # via Homebrew brew install ast-grep # via Cargo
| cargo install ast-grep # via npm npm i @ast-
| grep/cli -g # via pip pip install ast-grep-cli
| # I tested and pipx works too: pipx install ast-grep-cli
|
| I really like this - it means the tool is available to people
| with familiarity of any of those four distribution mechanisms.
|
| You can also download pre-built binaries from their releases
| page: https://github.com/ast-grep/ast-grep/releases/tag/0.14.2
|
| On top of that, they offer API bindings for it in three different
| languages:
|
| - Rust (not yet stable): https://docs.rs/ast-grep-
| core/latest/ast_grep_core/
|
| - JavaScript/TypeScript: https://ast-grep.github.io/guide/api-
| usage/js-api.html
|
| - Python: https://ast-grep.github.io/guide/api-usage/py-api.html
|
| It's rare to see a tool/library offer this depth of language
| support out of the box.
| simonw wrote:
| I was curious so I had a look at how the "pip install ast-grep-
| cli" command works. It downloads a wheel for the correct
| platform from https://pypi.org/project/ast-grep-cli/#files
|
| The wheel just contains the two binaries (sg and ast-grep) and
| no Python code: $ unzip -l
| ast_grep_cli-0.14.2-py3-none-macosx_10_7_x86_64.whl
| Archive: ast_grep_cli-0.14.2-py3-none-macosx_10_7_x86_64.whl
| Length Date Time Name --------- ----------
| ----- ---- 6207 12-03-2023 07:34
| ast_grep_cli-0.14.2.dist-info/METADATA 102
| 12-03-2023 07:34 ast_grep_cli-0.14.2.dist-info/WHEEL
| 1077 12-03-2023 07:34 ast_grep_cli-0.14.2.dist-
| info/license_files/LICENSE 1077 12-03-2023 07:34
| ast_grep_cli-0.14.2.dist-info/license_files/LICENSE
| 32865880 12-03-2023 07:34
| ast_grep_cli-0.14.2.data/scripts/sg 32865880
| 12-03-2023 07:34 ast_grep_cli-0.14.2.data/scripts/ast-grep
| 639 12-03-2023 07:34 ast_grep_cli-0.14.2.dist-info/RECORD
| --------- ------- 65740862
| 7 files
|
| I haven't seen pip and wheels used to distribute a purely
| binary tool like this before.
| charliermarsh wrote:
| This is how Ruff works too! (Ruff is also a standalone binary
| with no Python dependency.) If you're interested, I recommend
| checking out Maturin, which makes this pretty easy -- you can
| ship any standalone Rust binary as a Python package by
| zipping it into a wheel.
| herrington_d wrote:
| I confess I stole the pip recipe from Charlie :D
|
| https://github.com/astral-
| sh/ruff/blob/main/.github/workflow...
| tedunangst wrote:
| A looping gif is an unfortunate choice for a demo. It looks cool
| to start, but then I'm trying to see what it's done when it
| restarts and I have to sit through it again. Some before and
| after still screenshots would help.
| eviks wrote:
| indeed, this is purely text demo, and it wastes too much time
| with slow typing in the video while also preventing you from
| using search
| Conscat wrote:
| I've tried using this, but the documentation and learning
| resources weren't very good (at least at the time ~6 months ago)
| and structuring refactors with YAML made it very cumbersome for
| me to write and edit one-off commands.
|
| Tree Sitter also leaves a lot to be desired for C++ editing, but
| that's a special problem.
| simonw wrote:
| Looks like the project is only about 12 months old, so if you
| last checked it out 6 months ago it's worth taking another
| look.
|
| Was it possible to use it entirely as a CLI tool without any
| YAML 6 months ago?
| Conscat wrote:
| Unless the search/replace is super simple, you need the YAML
| as far as I can tell. The refactor I gave up on automating
| had to do with changing variadic C++ macros into arithmetic
| expressions, which wasn't conceptually very complicated, but
| felt almost impossible while constantly tripping over YAML
| syntax errors.
| simonw wrote:
| The YAML syntax I find most useful for this kind of thing
| is this: something: subkey: |
| I can put any characters I like in here And
| they "won't be messed up" by anything Because
| they are part of a multi-line string
| elanning wrote:
| Also plugging my related project: https://github.com/Ichigo-
| Labs/cgrep From the comments in this thread, it seems a lot of
| people have built or needed an easy way to quickly create static
| analysis checks, without a bunch of hassle. I think extended
| regex is a great way to do this.
| cglong wrote:
| I was hoping this could be a local replacement for Azure DevOps's
| functional code search[1], but this seems lower-level than that.
| Basically, I want a tool where I can write something like
| `class:Logger` and it'll show me which file(s) define a class
| with that name, or `ref:Logger` to find all usages of that/those
| class(es).
|
| [1]: https://learn.microsoft.com/en-
| us/azure/devops/project/searc...
___________________________________________________________________
(page generated 2023-12-10 23:00 UTC)