hngopher.com

       [HN Gopher] Semgrep: Semantic grep for code
       ___________________________________________________________________
        
       Semgrep: Semantic grep for code
        
       Author : ievans
       Score  : 253 points
       Date   : 2021-04-22 16:51 UTC (6 hours ago)
        
 (HTM) web link (semgrep.dev)
 (TXT) w3m dump (semgrep.dev)
        
       | pabs3 wrote:
       | Does it come with a standard set of rules that finds bad code
       | without any false positives out of the box? Or is it more of a
       | tool for people doing code security audits & pentesting who know
       | what they are looking for and want to read the surrounding code?
        
       | hyper_reality wrote:
       | This is an excellent tool to have as a security consultant, and
       | it just keeps getting better and better. When approaching a large
       | codebase, it enables you to write custom rules that match on
       | certain antipatterns you've spotted that may be unique to the
       | codebase. That's the real value of the tool, but the repository
       | of per-language rules is also convenient for quickly finding low-
       | hanging fruit (like every use of a potentially injectable
       | function such as exec,system,etc. in PHP).
       | 
       | For example, a webapp may have been designed such that
       | authorisation needs to be explicitly added with a line or two to
       | each controller. A semgrep rule can be written to match all the
       | controllers which are missing this line. Then these controllers
       | can be manually reviewed to assess whether unauthorised access
       | should be allowed. Depending on what you are trying to match,
       | this is something that may be very complex or even impossible to
       | implement accurately in plain grep. Some languages like Ruby have
       | powerful static analysis tools (Brakeman) that can also do this,
       | but the benefit of Semgrep is the flexibility across multiple
       | languages and how readable the rulesets are. [1]
       | 
       | [1] https://blog.includesecurity.com/2021/01/custom-static-
       | analy...
        
         | robertlagrant wrote:
         | Python you can probably do that just from the standard library.
         | ast.parse and all that.
        
           | smithza wrote:
           | Perhaps, but this tool has a syntax and common interface for
           | multiple languages. This is a huge resource for IT/security.
           | I could imagine this being a helpful refactoring tool as well
           | for CLI.
        
             | robertlagrant wrote:
             | No need to overstate. It's fine, but the rules seem to need
             | to be very bespoke, and any language's parser will do a
             | better job of figuring out syntax.
        
         | sytse wrote:
         | I agree it is an excellent tool. At GitLab we released our
         | Semgrep integration today
         | https://news.ycombinator.com/item?id=26903114
         | 
         | "GitLab SAST historically has been powered by over a dozen
         | open-source static analysis security analyzers. These analyzers
         | have proactively identified millions of vulnerabilities for
         | developers using GitLab every month. Each of these analyzers is
         | language-specific and has different technology approaches to
         | scanning. These differences produce overhead for updating,
         | managing, and maintaining additional features we build on top
         | of these tools, and they create confusion for anyone attempting
         | to debug.
         | 
         | The GitLab Static Analysis team is continuously evaluating new
         | security analyzers. We have been impressed by a relatively new
         | tool from the development team at r2c called Semgrep. It's a
         | fast, open-source, static analysis tool for finding bugs and
         | enforcing code standards. Semgrep's rules look like the code
         | you are searching for; this means you can write your own rules
         | without having to understand abstract syntax trees (ASTs) or
         | wrestle with regexes.
         | 
         | Semgrep's flexible rule syntax is ideal for streamlining
         | GitLab's Custom Rulesets feature for extending and modifying
         | detection rules, a popular request from GitLab SAST customers.
         | Semgrep also has a growing open-source registry of 1,000+
         | community rules.
         | 
         | We are in the process of transitioning many of our lint-based
         | SAST analyzers to Semgrep. This transition will help increase
         | stability, performance, rule coverage, and allow GitLab
         | customers access to Semgrep's community rules and additional
         | custom ruleset capabilities that we will be adding in the
         | future. We have enjoyed working with the r2c team and we cannot
         | wait to transition more of our analyzers to Semgrep. You can
         | read more in our transition epic, or try out our first
         | experimental Semgrep analyzers for JavaScript, TypeScript, and
         | Python.
         | 
         | We are excited about what this transition means for the future
         | of GitLab SAST and the larger Semgrep community. GitLab will be
         | contributing to the Semgrep open-source project including
         | additional rules to ensure coverage matches or exceeds our
         | existing analyzers."
        
         | tyingq wrote:
         | I'd be careful with how much of a warm fuzzy the tool gives
         | you. See this example from my other comment in the thread:
         | https://news.ycombinator.com/item?id=26905880 If it were really
         | looking at AST level data, that wouldn't have fooled it.
         | 
         | I suspect there would be similar issues with your example of
         | ensuring no use of eval() in PHP. So it seems okay to keep your
         | own developers informed, but I wouldn't use it, alone, to vet
         | outside code. PHP has eval-like functionality buried in
         | preg_replace(), assert(), and probably other places. This tool
         | also doesn't seem to dig into namespaced "aliases".
        
           | aseipp wrote:
           | > If it were really looking at AST level data, that wouldn't
           | have fooled it.
           | 
           | Semgrep does look at an AST; but that counterexample is not
           | something you can "fix" solely by looking at an AST. You need
           | actual Python-specific semantic analysis that knows that all
           | "open" functions like 'print' come from the builtins module,
           | and thus are bound to the same identifier. They're literally
           | built into the implementation, it's not something you can
           | "discover" from analyzing existing Python source. Even if you
           | had a perfectly accurate python AST it couldn't "tell" you
           | this fact, it's a priori knowledge, and all analysis engines
           | need a base set of facts like this that they work from.
           | 
           | > but I wouldn't use it, alone, to vet outside code.
           | 
           | I mean, nobody seems to be suggesting this though, and the OP
           | quite literally stated the major value of the tool is
           | enforcing domain/codebase-specific rules among a team. Which
           | is a really good use for it! There are tons of little useful
           | patterns you can codify this way.
        
             | tyingq wrote:
             | >Semgrep does look at an AST; but that counterexample is
             | not something you can "fix" solely by looking at an AST
             | 
             | Perhaps I worded it poorly. Dumping the python AST for
             | builtins.print() makes it pretty clear that it's "print"
             | though. So I'm curious why that skirts the rule.
             | 
             | >I mean, nobody seems to be suggesting this though
             | 
             | Not specificially, but the context is using it for security
             | purposes with phrases like _" every use of a potentially
             | injectable function such as exec,system,etc. in PHP"_. Felt
             | like that was worth commenting on.
        
               | [deleted]
        
               | joshuamorton wrote:
               | > Perhaps I worded it poorly. Dumping the python AST for
               | builtins.print() makes it pretty clear that it's "print"
               | though. So I'm curious why that skirts the rule.
               | 
               | To echo the other replies, the AST for builtins.print()
               | is the same as the ast for mymodule.print() and, in fact,
               | if you stick a builtins.py in the right place, you'll be
               | able to prevent the import of the standard library
               | builtin module, while the ast's would be _identical_.
        
               | dataangel wrote:
               | > Dumping the python AST for builtins.print() makes it
               | pretty clear that it's "print" though.
               | 
               | No, the AST only tells you it's a method call on
               | something called "builtins." You need the separate
               | semantic knowledge of what builtins is in order to figure
               | it out. Parsing + AST just means it sees "method call of
               | `print` on `builtins` object". Regular print calls would
               | come through as "regular function call of `print`".
        
         | brundolf wrote:
         | I'm honestly surprised we haven't seen complete democratization
         | of linting/grepping/refactoring tools. Imagine being able to
         | script a one-off refactor you want to make that's too project-
         | specific to be included in the IDE itself.
         | 
         | Writing the parser is nontrivial, but once you have it it
         | should be straightforward to expose a programmatic API for
         | doing this stuff instead of trying to hardcode every useful
         | linter rule into a single program.
        
           | cjohansson wrote:
           | Writing a parser for one specific version of a language is
           | one thing (nontrivial), but writing a parser than parsed most
           | versions of most programming languages and keeping it up-to-
           | date is a enormous undertaking and cannot be done by one
           | person
        
           | vbsteven wrote:
           | You can do this with Rust using a combination of rustc, syn
           | and a build.rs file. Then you can execute more Rust code on
           | the parsed AST.
           | 
           | In the JVM world annotation processors or compiler plugins
           | can have access to the AST during compilation. Both in Kotlin
           | and Java.
        
           | lamontcg wrote:
           | We do this with rubocop and its ability to parse the ruby
           | AST. The API to actually write rules in rubocop is not
           | particularly the best, but it has been highly successful in
           | codebase and domain-specific ruby linting and autofixing.
        
       | hn_throwaway_99 wrote:
       | I currently use a highly opinionated ESLint config (based on the
       | airbnb one) together with strict checking in my TypeScript
       | config, and it is configured to run on every commit with husky
       | git hooks. The example given on the Semgrep homepage is an exact
       | match to one that exists in my ESLint config (eslint's no-console
       | rule).
       | 
       | How does Semgrep compare to ESLint+a strict tsconfig?
        
         | RichieAHB wrote:
         | It seems from reading the docs that the "semantic" side is
         | important here. It can track things like redefinitions.
         | 
         | E.g. const output = console.log; output("hello");
         | 
         | This wouldn't be caught by ESLint (to my knowledge) but would
         | be caught by Semgrep. I think you could do it with ESLint but
         | given the interface for an ESLint plugin exposes an AST, you'd
         | have to track this yourself. I'm assuming Semgrep could stretch
         | to things like enforcing APIs are called with certain optional
         | arguments present (even if the TS types don't require it).
         | Again, I think with ESLint you'd have to do more juggling with
         | the raw AST.
         | 
         | That said, this is my understanding from a quick skim!
        
         | avodonosov wrote:
         | How do you deal with false positives?
         | 
         | If the commit hook rejects anything where rules are triggered,
         | a way to force the comrit is needed for the cases when the rule
         | finding is not reanny an issue.
         | 
         |  _Upd:_ I found that the --no-verify option can be used in many
         | cases
        
           | hn_throwaway_99 wrote:
           | You can put comments in the code to ignore eslint rules on a
           | specific line, or for the whole file.
        
       | realquadrant wrote:
       | Hi, this is very cool. I have been building up a suite of tools
       | to roll out across major open source projects to improve
       | security. I like what I have seen so far, this is a great use
       | case. Whom can I connect with to learn more? And similarity/diff
       | with sourcegraph, also like a lot.
        
       | shuringai wrote:
       | This is much better alternative to codeQL used by google and does
       | not use a shameless registration-only model! Thanks for sharing
        
         | shuringai wrote:
         | *github, not google my bad
        
       | unwind wrote:
       | When tools like this use terms like "legacy languages", and don't
       | show that C is supported unless you click "More Languages", it
       | makes me feel old. :)
       | 
       | Still, it seems rather cool, I like the idea of being able to
       | search code at a higher level than just raw source text.
        
       | minusf wrote:
       | probably doing something wrong but running the ci ruleset on a
       | tiny django hobby project made all cores spin at 100% after 33%
       | of the progress bar and made the OS almost unresponsive. ctrl-c
       | after 5 minutes and i still had to pkill every semgrep process...
       | never seen the M1 airbook overheat this much before.
        
       | leafmeal wrote:
       | What does this give you over writing a flake8 plugin (for Python
       | at least)?
       | 
       | I've found the flake8 API and documentation lacking, so perhaps
       | just a cleaner interface?
        
       | thesuperbigfrog wrote:
       | The name "Semantic Grep" does not give a good idea for what this
       | tool is and what it does.
       | 
       | The web page states: "Static analysis at ludicrous speed. Find
       | bugs and enforce code standards"
       | 
       | "grep" is short for "global regular expression print". It finds
       | matches for the given regular expression and prints them.
       | 
       | "Semantic Grep" is a static analyzer with configurable rules,
       | style checks, etc. It does much more than search and print.
       | 
       | Perhaps a better name is needed?
       | 
       | Edit: How about "omnilint" or "omnicritic" since semgrep is more
       | of a "lint" (https://en.wikipedia.org/wiki/Lint_(software)) or
       | "critic" (https://en.wikipedia.org/wiki/Perl::Critic) type of
       | tool that handles multiple languages?
       | 
       | Edit2: "Static analysis at ludicrous speed" ==> "turbolint"?
       | ("ludicrous speed" reminds of the hilarious Space Balls scene :)
       | "turbolint, GO!"
        
         | petters wrote:
         | The literal meaning of "grep" is not the only meaning. It also
         | means "find snippet in files."
        
           | thesuperbigfrog wrote:
           | Yes.
           | 
           | But at least to me, semgrep looks a lot more like "lint" than
           | "grep".
        
             | prepend wrote:
             | You have to find stuff before you lint it.
        
         | alpaca128 wrote:
         | I clicked on this thinking it's a grep that can search code
         | snippets based on language-aware syntax matching instead of
         | regular expressions.
         | 
         | Agreed, this project name is misleading about what it does. The
         | name "grep" always indicated some kind of "find a text/pattern
         | and print results to stdout" utility. Like pgrep, which
         | searches running processes by name and then prints their IDs.
        
           | underyx wrote:
           | > it's a grep that can search code snippets based on
           | language-aware syntax matching instead of regular
           | expressions.
           | 
           | Hey, I'm a maintainer of Semgrep, and this sounds like a
           | pretty good description of what the CLI can do, see this
           | example for finding all function/class/method calls:
           | $ semgrep -e '$NAME(...)' -l python
           | flask_todomvc/extensions.py         4:db = SQLAlchemy()
           | ------------------------------------------------------------
           | 5:security = Security()              flask_todomvc/factory.py
           | 15:    app = Flask(__name__)
           | ------------------------------------------------------------
           | 17:    app.config.from_object(settings)
           | ------------------------------------------------------------
           | 18:    app.config.from_envvar('TODO_SETTINGS', silent=True)
        
             | alpaca128 wrote:
             | Oh, that's great to see. The website's presentation made a
             | different impression with its "enforce code standards"
             | angle.
             | 
             | Looks like a pretty useful tool with a couple nice options.
             | A bit strange that the `-e` option is only explained on the
             | website, but to be fair it seems to be a lot to cover.
             | Still, a kind of "cheat sheet" style summary in the help
             | message would be fantastic, just as a little suggestion.
        
         | smithza wrote:
         | There is the common, if informal, definition that grep means
         | "command line text search tool". I read this as
         | "semantic/syntax search tool".
        
           | thesuperbigfrog wrote:
           | But semgrep is much closer to a linter / critic program than
           | a grep program.
        
         | [deleted]
        
       | [deleted]
        
       | rmetzler wrote:
       | Looks like a useful tool for me and I would like to try it.
       | 
       | Go down, see "brew install semgrep" and try to copy paste it. And
       | it's an image :(
        
       | afro88 wrote:
       | No swift support yet. What would be involved in adding it?
        
       | westurner wrote:
       | Is there a more complete example of how to call semgrep from pre-
       | commit (which gets called before every git commit) in order to
       | prevent e.g. Python print calls (print(), print \\\n(), etc.)
       | from being checked in?
       | 
       | https://semgrep.dev/docs/extensions/ describes how to do pre-
       | commit.
       | 
       | Nvm, here's semgrep's own .pre-commit-config.yml for semgrep
       | itself:
       | https://github.com/returntocorp/semgrep/blob/develop/.pre-co...
        
         | theptip wrote:
         | I've never used the `pre-commit` framework, but it's really
         | simple to wire up arbitrary shell scripts; check out the
         | 
         | `.git/hooks` directory in your repo for samples, e.g.
         | `.git/hooks/pre-commit.sample`.
         | 
         | You can run any old shell script there, without having to
         | install a python tool.
        
       | more_corn wrote:
       | I used to use SAST-SCAN but that seems abandonware. I like that
       | this exists. Everyone should go from nothing to something in the
       | SAST space. A free/freemium tool/service for that is pretty
       | great. The first couple runs have found useful results.
        
       | layer8 wrote:
       | No Windows support yet:
       | https://github.com/returntocorp/semgrep/issues/1330
        
         | twh270 wrote:
         | From the thread you link it looks like they're getting close,
         | there's been activity in the past few days.
         | 
         | (I'm guessing from your comment that this is important to you
         | (i.e. WSL/Docker is not a solution).)
        
         | IshKebab wrote:
         | Kind of crazy that you can make a tool like this in the modern
         | age that isn't cross platform. Maybe they just can't face the
         | Python packaging nightmare on Windows.
         | 
         | Also kind of surprising it's written in Python given that they
         | advertise its speed.
        
           | dlukeomalley wrote:
           | I was surprised by this too, but as a maintainer of the
           | project we've had comparatively little Windows requests
           | compared to other systems. Qualitatively, for the folks who
           | have asked for Windows support, WSL has been a viable option
           | and reduced the support surface area for the small team
           | behind the tool.
           | 
           | On the speed front, the core of Semgrep is OCaml, with a
           | Python wrapper to add niceties like pattern composition. I'm
           | less close to that part of the codebase, but think more and
           | more logic is being moved to OCaml for performance reasons.
        
           | prepend wrote:
           | I'm always surprised at stuff I take for granted that doesn't
           | work on Windows. So yeah, it seems like cross-platform should
           | be easy but since my dev environment is zsh, it's easy for my
           | stuff to work sort of "everywhere but Windows."
           | 
           | Add to that that the reason things fail on Windows is usually
           | something Windows specific and "their fault." So it's unusual
           | for me to fire up a Windows VM just to sanity check my code.
           | And since Windows CI runners cost 2x or more, I don't usually
           | run cross platform CI.
        
             | IshKebab wrote:
             | This is an open source project on Github. They can use
             | Github Actions which has free runners for Windows, Mac and
             | Linux.
        
               | prepend wrote:
               | Windows runners consume 2x minutes, Mac runners consume
               | 10x minutes [0].
               | 
               | Free projects only get 2,000 minutes per month so running
               | on all three platforms means you only get 1/13th of the
               | minutes of only using Linux.
               | 
               | I rarely use anything but Linux runners even on my paid
               | projects. I like saving money, so unless I really need
               | integration testing on Windows or Mac, I don't do it.
               | 
               | [0] https://docs.github.com/en/github/setting-up-and-
               | managing-bi...
        
       | silasb wrote:
       | Just the tool that I was looking for. We are looking to do
       | Service linting in our organization as a method of making sure
       | our services don't drift too far apart.
       | 
       | Anyone else know of a Service linting tool? OPA/conftest come
       | close but lack syntax parsers for Ruby/Javascript.
        
       | jhgb wrote:
       | Isn't "grep for code" called just "grep"?
        
         | vinceguidry wrote:
         | Tagline appears to be submitter's, not the project's.
        
           | mynegation wrote:
           | ievans works for r2c
        
         | ievans wrote:
         | Semgrep started off as a "syntactic grep" but has increasingly
         | become more semantic. So if you want to find all calls to foo
         | that have 1 as the first argument, you just search for foo(1)
         | and even things like x = 1; foo(x); will match.
         | 
         | Here's an elaborate example:
         | https://semgrep.dev/s/ievans:c-dataflow
        
           | tyingq wrote:
           | It does seem potentially good for enforcing standards where
           | the participants are willing. But you can work around it
           | fairly easily. Like the example "python no-prints" rule:
           | https://semgrep.dev/s/sabihb:no-prints
           | 
           | Lots of workarounds it wouldn't find, like:
           | import builtins       builtins.print("whee")
        
             | sdesol wrote:
             | I'm guessing the value with the rules system is, you can
             | add new rules easily. So during a code review, if you see
             | somebody using your example, you could create a new rule to
             | catch that.
             | 
             | I don't think you can go in with the mindset that it will
             | catch everything, but rather, it's about being able to
             | iterate quickly with your rules.
        
               | tyingq wrote:
               | Sure. I raised it because it keeps using the word "static
               | analysis", which at least wouldn't be fooled by
               | builtins.print(). It's more than grep, for sure, but
               | something less than static analysis tools I've used. And
               | I don't mean that as a knock. It's working across a lot
               | of languages, so I see the tradeoff.
        
           | dang wrote:
           | Ok, I've put the word 'semantic' up there so we can not get
           | hung up on title stuff. (Submitted title said "Like Grep but
           | for Code".)
        
           | jhgb wrote:
           | That's called "a code walker", isn't it?
        
       | vlovich123 wrote:
       | I want the ease of use of their AST specification with the power
       | of clang's refactor tool. Has anyone attempted to do that?
        
       | SavantIdiot wrote:
       | Since the capability has never existed, I don't think in terms of
       | being able to semgrep. If that makes any sense. My brain is not
       | wired this way, yet.
       | 
       | Like, if you've never tasted lychee, it would never occur to you
       | how to cook with it.
       | 
       | I'm going to need to see some useful, real-world examples to
       | jumpstart my brain to think this way.
        
         | saagarjha wrote:
         | You can cook with lychee?
        
         | underyx wrote:
         | Hey, I work on Semgrep. As a real world example, I just noticed
         | today that Hashicorp uses a whole bunch of Semgrep rules on
         | terraform-provider-aws[0]. I'd recommend reading the `message`
         | keys to know what they intend to match, and then the `patterns`
         | lists below to see how that's accomplished.
         | 
         | Alternatively, we curate 1000+ community rules that you can
         | look through as well.[1]
         | 
         | [0]: https://github.com/hashicorp/terraform-provider-
         | aws/blob/mai...
         | 
         | [1]: https://semgrep.dev/r
        
           | SavantIdiot wrote:
           | Nice! Thanks. This will certainly help me start to thinking
           | in semantic grep. I can see this being an additional coverage
           | tool and am eager to study it.
        
       | pantuza wrote:
       | Really outstanding those guardrails rules from semgrep. Useful to
       | enforce code. Thanks for sharing the tool.
        
       | eric_fib wrote:
       | grep grep
        
       | kesterallen wrote:
       | Typo in the "Trying Semgrep" screenshot ("ruleste"):
       | https://semgrep.dev/static/media/Step1.df848497.png
        
       | enriquto wrote:
       | > You need to enable JavaScript to run this app.
       | 
       | Wait, is this a web app? I was expecting a command line tool to
       | navigate my code locally.
        
         | exdsq wrote:
         | There's a demo on the site, I assume that's it?
        
           | enriquto wrote:
           | No, it's just the landing page. Apparently it does not allow
           | to see it without running some javascript.
        
             | sahkopoyta wrote:
             | Well the demo is there on the landing page
        
         | thomasahle wrote:
         | The cli is here: https://semgrep.dev/docs/getting-started/
         | 
         | You can write stuff like                   # Check for Python
         | == where the left and right hand sides are the same (often a
         | bug)         $ semgrep -e '$X == $X' --lang=py path/to/src
        
           | enriquto wrote:
           | Cool! But this example is a bit simplistic since it can be
           | done just as easily by regular grep:                   grep
           | -E '(.+) = \1' *.py
           | 
           | I have trouble looking at the examples in the project website
           | (many things inside iframes are adblocked). Do you have any
           | example of a search that would be difficult or impossible
           | with grep?
        
             | fragmede wrote:
             | If you're that capable with grep, surely you've run into
             | it's shortcomings and can appreciate attempts to improve
             | the status quo. Like trying to find usages of a variable
             | that is a substring of a word. Like trying to fix code
             | where somebody else has used the variable name _i_. Sure,
             | you can add matching on word boundaries, but then it starts
             | to become a distraction from the original task at hand.
        
             | SavantIdiot wrote:
             | It can infer (x==y) if x=1 and y=1, which is grep cannot
             | do.
        
               | enriquto wrote:
               | This must surely fail... isn't it equivalent to solving
               | the halting problem? Or does it run the whole program
               | like a debugger?
        
               | Spivak wrote:
               | It fails pretty fast.                   import random
               | y = 0              def f(x):             print(x)
               | if random.randint(0,1) == 2:             y = 1
               | f(y)
               | 
               | This not only fails but crashes the program.
        
               | Quekid5 wrote:
               | If false negatives are permitted the halting problem
               | isn't an issue.
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-04-22 23:00 UTC)