[HN Gopher] Semgrep: Semantic grep for code
___________________________________________________________________
Semgrep: Semantic grep for code
Author : ievans
Score : 253 points
Date : 2021-04-22 16:51 UTC (6 hours ago)
(HTM) web link (semgrep.dev)
(TXT) w3m dump (semgrep.dev)
| pabs3 wrote:
| Does it come with a standard set of rules that finds bad code
| without any false positives out of the box? Or is it more of a
| tool for people doing code security audits & pentesting who know
| what they are looking for and want to read the surrounding code?
| hyper_reality wrote:
| This is an excellent tool to have as a security consultant, and
| it just keeps getting better and better. When approaching a large
| codebase, it enables you to write custom rules that match on
| certain antipatterns you've spotted that may be unique to the
| codebase. That's the real value of the tool, but the repository
| of per-language rules is also convenient for quickly finding low-
| hanging fruit (like every use of a potentially injectable
| function such as exec,system,etc. in PHP).
|
| For example, a webapp may have been designed such that
| authorisation needs to be explicitly added with a line or two to
| each controller. A semgrep rule can be written to match all the
| controllers which are missing this line. Then these controllers
| can be manually reviewed to assess whether unauthorised access
| should be allowed. Depending on what you are trying to match,
| this is something that may be very complex or even impossible to
| implement accurately in plain grep. Some languages like Ruby have
| powerful static analysis tools (Brakeman) that can also do this,
| but the benefit of Semgrep is the flexibility across multiple
| languages and how readable the rulesets are. [1]
|
| [1] https://blog.includesecurity.com/2021/01/custom-static-
| analy...
| robertlagrant wrote:
| Python you can probably do that just from the standard library.
| ast.parse and all that.
| smithza wrote:
| Perhaps, but this tool has a syntax and common interface for
| multiple languages. This is a huge resource for IT/security.
| I could imagine this being a helpful refactoring tool as well
| for CLI.
| robertlagrant wrote:
| No need to overstate. It's fine, but the rules seem to need
| to be very bespoke, and any language's parser will do a
| better job of figuring out syntax.
| sytse wrote:
| I agree it is an excellent tool. At GitLab we released our
| Semgrep integration today
| https://news.ycombinator.com/item?id=26903114
|
| "GitLab SAST historically has been powered by over a dozen
| open-source static analysis security analyzers. These analyzers
| have proactively identified millions of vulnerabilities for
| developers using GitLab every month. Each of these analyzers is
| language-specific and has different technology approaches to
| scanning. These differences produce overhead for updating,
| managing, and maintaining additional features we build on top
| of these tools, and they create confusion for anyone attempting
| to debug.
|
| The GitLab Static Analysis team is continuously evaluating new
| security analyzers. We have been impressed by a relatively new
| tool from the development team at r2c called Semgrep. It's a
| fast, open-source, static analysis tool for finding bugs and
| enforcing code standards. Semgrep's rules look like the code
| you are searching for; this means you can write your own rules
| without having to understand abstract syntax trees (ASTs) or
| wrestle with regexes.
|
| Semgrep's flexible rule syntax is ideal for streamlining
| GitLab's Custom Rulesets feature for extending and modifying
| detection rules, a popular request from GitLab SAST customers.
| Semgrep also has a growing open-source registry of 1,000+
| community rules.
|
| We are in the process of transitioning many of our lint-based
| SAST analyzers to Semgrep. This transition will help increase
| stability, performance, rule coverage, and allow GitLab
| customers access to Semgrep's community rules and additional
| custom ruleset capabilities that we will be adding in the
| future. We have enjoyed working with the r2c team and we cannot
| wait to transition more of our analyzers to Semgrep. You can
| read more in our transition epic, or try out our first
| experimental Semgrep analyzers for JavaScript, TypeScript, and
| Python.
|
| We are excited about what this transition means for the future
| of GitLab SAST and the larger Semgrep community. GitLab will be
| contributing to the Semgrep open-source project including
| additional rules to ensure coverage matches or exceeds our
| existing analyzers."
| tyingq wrote:
| I'd be careful with how much of a warm fuzzy the tool gives
| you. See this example from my other comment in the thread:
| https://news.ycombinator.com/item?id=26905880 If it were really
| looking at AST level data, that wouldn't have fooled it.
|
| I suspect there would be similar issues with your example of
| ensuring no use of eval() in PHP. So it seems okay to keep your
| own developers informed, but I wouldn't use it, alone, to vet
| outside code. PHP has eval-like functionality buried in
| preg_replace(), assert(), and probably other places. This tool
| also doesn't seem to dig into namespaced "aliases".
| aseipp wrote:
| > If it were really looking at AST level data, that wouldn't
| have fooled it.
|
| Semgrep does look at an AST; but that counterexample is not
| something you can "fix" solely by looking at an AST. You need
| actual Python-specific semantic analysis that knows that all
| "open" functions like 'print' come from the builtins module,
| and thus are bound to the same identifier. They're literally
| built into the implementation, it's not something you can
| "discover" from analyzing existing Python source. Even if you
| had a perfectly accurate python AST it couldn't "tell" you
| this fact, it's a priori knowledge, and all analysis engines
| need a base set of facts like this that they work from.
|
| > but I wouldn't use it, alone, to vet outside code.
|
| I mean, nobody seems to be suggesting this though, and the OP
| quite literally stated the major value of the tool is
| enforcing domain/codebase-specific rules among a team. Which
| is a really good use for it! There are tons of little useful
| patterns you can codify this way.
| tyingq wrote:
| >Semgrep does look at an AST; but that counterexample is
| not something you can "fix" solely by looking at an AST
|
| Perhaps I worded it poorly. Dumping the python AST for
| builtins.print() makes it pretty clear that it's "print"
| though. So I'm curious why that skirts the rule.
|
| >I mean, nobody seems to be suggesting this though
|
| Not specificially, but the context is using it for security
| purposes with phrases like _" every use of a potentially
| injectable function such as exec,system,etc. in PHP"_. Felt
| like that was worth commenting on.
| [deleted]
| joshuamorton wrote:
| > Perhaps I worded it poorly. Dumping the python AST for
| builtins.print() makes it pretty clear that it's "print"
| though. So I'm curious why that skirts the rule.
|
| To echo the other replies, the AST for builtins.print()
| is the same as the ast for mymodule.print() and, in fact,
| if you stick a builtins.py in the right place, you'll be
| able to prevent the import of the standard library
| builtin module, while the ast's would be _identical_.
| dataangel wrote:
| > Dumping the python AST for builtins.print() makes it
| pretty clear that it's "print" though.
|
| No, the AST only tells you it's a method call on
| something called "builtins." You need the separate
| semantic knowledge of what builtins is in order to figure
| it out. Parsing + AST just means it sees "method call of
| `print` on `builtins` object". Regular print calls would
| come through as "regular function call of `print`".
| brundolf wrote:
| I'm honestly surprised we haven't seen complete democratization
| of linting/grepping/refactoring tools. Imagine being able to
| script a one-off refactor you want to make that's too project-
| specific to be included in the IDE itself.
|
| Writing the parser is nontrivial, but once you have it it
| should be straightforward to expose a programmatic API for
| doing this stuff instead of trying to hardcode every useful
| linter rule into a single program.
| cjohansson wrote:
| Writing a parser for one specific version of a language is
| one thing (nontrivial), but writing a parser than parsed most
| versions of most programming languages and keeping it up-to-
| date is a enormous undertaking and cannot be done by one
| person
| vbsteven wrote:
| You can do this with Rust using a combination of rustc, syn
| and a build.rs file. Then you can execute more Rust code on
| the parsed AST.
|
| In the JVM world annotation processors or compiler plugins
| can have access to the AST during compilation. Both in Kotlin
| and Java.
| lamontcg wrote:
| We do this with rubocop and its ability to parse the ruby
| AST. The API to actually write rules in rubocop is not
| particularly the best, but it has been highly successful in
| codebase and domain-specific ruby linting and autofixing.
| hn_throwaway_99 wrote:
| I currently use a highly opinionated ESLint config (based on the
| airbnb one) together with strict checking in my TypeScript
| config, and it is configured to run on every commit with husky
| git hooks. The example given on the Semgrep homepage is an exact
| match to one that exists in my ESLint config (eslint's no-console
| rule).
|
| How does Semgrep compare to ESLint+a strict tsconfig?
| RichieAHB wrote:
| It seems from reading the docs that the "semantic" side is
| important here. It can track things like redefinitions.
|
| E.g. const output = console.log; output("hello");
|
| This wouldn't be caught by ESLint (to my knowledge) but would
| be caught by Semgrep. I think you could do it with ESLint but
| given the interface for an ESLint plugin exposes an AST, you'd
| have to track this yourself. I'm assuming Semgrep could stretch
| to things like enforcing APIs are called with certain optional
| arguments present (even if the TS types don't require it).
| Again, I think with ESLint you'd have to do more juggling with
| the raw AST.
|
| That said, this is my understanding from a quick skim!
| avodonosov wrote:
| How do you deal with false positives?
|
| If the commit hook rejects anything where rules are triggered,
| a way to force the comrit is needed for the cases when the rule
| finding is not reanny an issue.
|
| _Upd:_ I found that the --no-verify option can be used in many
| cases
| hn_throwaway_99 wrote:
| You can put comments in the code to ignore eslint rules on a
| specific line, or for the whole file.
| realquadrant wrote:
| Hi, this is very cool. I have been building up a suite of tools
| to roll out across major open source projects to improve
| security. I like what I have seen so far, this is a great use
| case. Whom can I connect with to learn more? And similarity/diff
| with sourcegraph, also like a lot.
| shuringai wrote:
| This is much better alternative to codeQL used by google and does
| not use a shameless registration-only model! Thanks for sharing
| shuringai wrote:
| *github, not google my bad
| unwind wrote:
| When tools like this use terms like "legacy languages", and don't
| show that C is supported unless you click "More Languages", it
| makes me feel old. :)
|
| Still, it seems rather cool, I like the idea of being able to
| search code at a higher level than just raw source text.
| minusf wrote:
| probably doing something wrong but running the ci ruleset on a
| tiny django hobby project made all cores spin at 100% after 33%
| of the progress bar and made the OS almost unresponsive. ctrl-c
| after 5 minutes and i still had to pkill every semgrep process...
| never seen the M1 airbook overheat this much before.
| leafmeal wrote:
| What does this give you over writing a flake8 plugin (for Python
| at least)?
|
| I've found the flake8 API and documentation lacking, so perhaps
| just a cleaner interface?
| thesuperbigfrog wrote:
| The name "Semantic Grep" does not give a good idea for what this
| tool is and what it does.
|
| The web page states: "Static analysis at ludicrous speed. Find
| bugs and enforce code standards"
|
| "grep" is short for "global regular expression print". It finds
| matches for the given regular expression and prints them.
|
| "Semantic Grep" is a static analyzer with configurable rules,
| style checks, etc. It does much more than search and print.
|
| Perhaps a better name is needed?
|
| Edit: How about "omnilint" or "omnicritic" since semgrep is more
| of a "lint" (https://en.wikipedia.org/wiki/Lint_(software)) or
| "critic" (https://en.wikipedia.org/wiki/Perl::Critic) type of
| tool that handles multiple languages?
|
| Edit2: "Static analysis at ludicrous speed" ==> "turbolint"?
| ("ludicrous speed" reminds of the hilarious Space Balls scene :)
| "turbolint, GO!"
| petters wrote:
| The literal meaning of "grep" is not the only meaning. It also
| means "find snippet in files."
| thesuperbigfrog wrote:
| Yes.
|
| But at least to me, semgrep looks a lot more like "lint" than
| "grep".
| prepend wrote:
| You have to find stuff before you lint it.
| alpaca128 wrote:
| I clicked on this thinking it's a grep that can search code
| snippets based on language-aware syntax matching instead of
| regular expressions.
|
| Agreed, this project name is misleading about what it does. The
| name "grep" always indicated some kind of "find a text/pattern
| and print results to stdout" utility. Like pgrep, which
| searches running processes by name and then prints their IDs.
| underyx wrote:
| > it's a grep that can search code snippets based on
| language-aware syntax matching instead of regular
| expressions.
|
| Hey, I'm a maintainer of Semgrep, and this sounds like a
| pretty good description of what the CLI can do, see this
| example for finding all function/class/method calls:
| $ semgrep -e '$NAME(...)' -l python
| flask_todomvc/extensions.py 4:db = SQLAlchemy()
| ------------------------------------------------------------
| 5:security = Security() flask_todomvc/factory.py
| 15: app = Flask(__name__)
| ------------------------------------------------------------
| 17: app.config.from_object(settings)
| ------------------------------------------------------------
| 18: app.config.from_envvar('TODO_SETTINGS', silent=True)
| alpaca128 wrote:
| Oh, that's great to see. The website's presentation made a
| different impression with its "enforce code standards"
| angle.
|
| Looks like a pretty useful tool with a couple nice options.
| A bit strange that the `-e` option is only explained on the
| website, but to be fair it seems to be a lot to cover.
| Still, a kind of "cheat sheet" style summary in the help
| message would be fantastic, just as a little suggestion.
| smithza wrote:
| There is the common, if informal, definition that grep means
| "command line text search tool". I read this as
| "semantic/syntax search tool".
| thesuperbigfrog wrote:
| But semgrep is much closer to a linter / critic program than
| a grep program.
| [deleted]
| [deleted]
| rmetzler wrote:
| Looks like a useful tool for me and I would like to try it.
|
| Go down, see "brew install semgrep" and try to copy paste it. And
| it's an image :(
| afro88 wrote:
| No swift support yet. What would be involved in adding it?
| westurner wrote:
| Is there a more complete example of how to call semgrep from pre-
| commit (which gets called before every git commit) in order to
| prevent e.g. Python print calls (print(), print \\\n(), etc.)
| from being checked in?
|
| https://semgrep.dev/docs/extensions/ describes how to do pre-
| commit.
|
| Nvm, here's semgrep's own .pre-commit-config.yml for semgrep
| itself:
| https://github.com/returntocorp/semgrep/blob/develop/.pre-co...
| theptip wrote:
| I've never used the `pre-commit` framework, but it's really
| simple to wire up arbitrary shell scripts; check out the
|
| `.git/hooks` directory in your repo for samples, e.g.
| `.git/hooks/pre-commit.sample`.
|
| You can run any old shell script there, without having to
| install a python tool.
| more_corn wrote:
| I used to use SAST-SCAN but that seems abandonware. I like that
| this exists. Everyone should go from nothing to something in the
| SAST space. A free/freemium tool/service for that is pretty
| great. The first couple runs have found useful results.
| layer8 wrote:
| No Windows support yet:
| https://github.com/returntocorp/semgrep/issues/1330
| twh270 wrote:
| From the thread you link it looks like they're getting close,
| there's been activity in the past few days.
|
| (I'm guessing from your comment that this is important to you
| (i.e. WSL/Docker is not a solution).)
| IshKebab wrote:
| Kind of crazy that you can make a tool like this in the modern
| age that isn't cross platform. Maybe they just can't face the
| Python packaging nightmare on Windows.
|
| Also kind of surprising it's written in Python given that they
| advertise its speed.
| dlukeomalley wrote:
| I was surprised by this too, but as a maintainer of the
| project we've had comparatively little Windows requests
| compared to other systems. Qualitatively, for the folks who
| have asked for Windows support, WSL has been a viable option
| and reduced the support surface area for the small team
| behind the tool.
|
| On the speed front, the core of Semgrep is OCaml, with a
| Python wrapper to add niceties like pattern composition. I'm
| less close to that part of the codebase, but think more and
| more logic is being moved to OCaml for performance reasons.
| prepend wrote:
| I'm always surprised at stuff I take for granted that doesn't
| work on Windows. So yeah, it seems like cross-platform should
| be easy but since my dev environment is zsh, it's easy for my
| stuff to work sort of "everywhere but Windows."
|
| Add to that that the reason things fail on Windows is usually
| something Windows specific and "their fault." So it's unusual
| for me to fire up a Windows VM just to sanity check my code.
| And since Windows CI runners cost 2x or more, I don't usually
| run cross platform CI.
| IshKebab wrote:
| This is an open source project on Github. They can use
| Github Actions which has free runners for Windows, Mac and
| Linux.
| prepend wrote:
| Windows runners consume 2x minutes, Mac runners consume
| 10x minutes [0].
|
| Free projects only get 2,000 minutes per month so running
| on all three platforms means you only get 1/13th of the
| minutes of only using Linux.
|
| I rarely use anything but Linux runners even on my paid
| projects. I like saving money, so unless I really need
| integration testing on Windows or Mac, I don't do it.
|
| [0] https://docs.github.com/en/github/setting-up-and-
| managing-bi...
| silasb wrote:
| Just the tool that I was looking for. We are looking to do
| Service linting in our organization as a method of making sure
| our services don't drift too far apart.
|
| Anyone else know of a Service linting tool? OPA/conftest come
| close but lack syntax parsers for Ruby/Javascript.
| jhgb wrote:
| Isn't "grep for code" called just "grep"?
| vinceguidry wrote:
| Tagline appears to be submitter's, not the project's.
| mynegation wrote:
| ievans works for r2c
| ievans wrote:
| Semgrep started off as a "syntactic grep" but has increasingly
| become more semantic. So if you want to find all calls to foo
| that have 1 as the first argument, you just search for foo(1)
| and even things like x = 1; foo(x); will match.
|
| Here's an elaborate example:
| https://semgrep.dev/s/ievans:c-dataflow
| tyingq wrote:
| It does seem potentially good for enforcing standards where
| the participants are willing. But you can work around it
| fairly easily. Like the example "python no-prints" rule:
| https://semgrep.dev/s/sabihb:no-prints
|
| Lots of workarounds it wouldn't find, like:
| import builtins builtins.print("whee")
| sdesol wrote:
| I'm guessing the value with the rules system is, you can
| add new rules easily. So during a code review, if you see
| somebody using your example, you could create a new rule to
| catch that.
|
| I don't think you can go in with the mindset that it will
| catch everything, but rather, it's about being able to
| iterate quickly with your rules.
| tyingq wrote:
| Sure. I raised it because it keeps using the word "static
| analysis", which at least wouldn't be fooled by
| builtins.print(). It's more than grep, for sure, but
| something less than static analysis tools I've used. And
| I don't mean that as a knock. It's working across a lot
| of languages, so I see the tradeoff.
| dang wrote:
| Ok, I've put the word 'semantic' up there so we can not get
| hung up on title stuff. (Submitted title said "Like Grep but
| for Code".)
| jhgb wrote:
| That's called "a code walker", isn't it?
| vlovich123 wrote:
| I want the ease of use of their AST specification with the power
| of clang's refactor tool. Has anyone attempted to do that?
| SavantIdiot wrote:
| Since the capability has never existed, I don't think in terms of
| being able to semgrep. If that makes any sense. My brain is not
| wired this way, yet.
|
| Like, if you've never tasted lychee, it would never occur to you
| how to cook with it.
|
| I'm going to need to see some useful, real-world examples to
| jumpstart my brain to think this way.
| saagarjha wrote:
| You can cook with lychee?
| underyx wrote:
| Hey, I work on Semgrep. As a real world example, I just noticed
| today that Hashicorp uses a whole bunch of Semgrep rules on
| terraform-provider-aws[0]. I'd recommend reading the `message`
| keys to know what they intend to match, and then the `patterns`
| lists below to see how that's accomplished.
|
| Alternatively, we curate 1000+ community rules that you can
| look through as well.[1]
|
| [0]: https://github.com/hashicorp/terraform-provider-
| aws/blob/mai...
|
| [1]: https://semgrep.dev/r
| SavantIdiot wrote:
| Nice! Thanks. This will certainly help me start to thinking
| in semantic grep. I can see this being an additional coverage
| tool and am eager to study it.
| pantuza wrote:
| Really outstanding those guardrails rules from semgrep. Useful to
| enforce code. Thanks for sharing the tool.
| eric_fib wrote:
| grep grep
| kesterallen wrote:
| Typo in the "Trying Semgrep" screenshot ("ruleste"):
| https://semgrep.dev/static/media/Step1.df848497.png
| enriquto wrote:
| > You need to enable JavaScript to run this app.
|
| Wait, is this a web app? I was expecting a command line tool to
| navigate my code locally.
| exdsq wrote:
| There's a demo on the site, I assume that's it?
| enriquto wrote:
| No, it's just the landing page. Apparently it does not allow
| to see it without running some javascript.
| sahkopoyta wrote:
| Well the demo is there on the landing page
| thomasahle wrote:
| The cli is here: https://semgrep.dev/docs/getting-started/
|
| You can write stuff like # Check for Python
| == where the left and right hand sides are the same (often a
| bug) $ semgrep -e '$X == $X' --lang=py path/to/src
| enriquto wrote:
| Cool! But this example is a bit simplistic since it can be
| done just as easily by regular grep: grep
| -E '(.+) = \1' *.py
|
| I have trouble looking at the examples in the project website
| (many things inside iframes are adblocked). Do you have any
| example of a search that would be difficult or impossible
| with grep?
| fragmede wrote:
| If you're that capable with grep, surely you've run into
| it's shortcomings and can appreciate attempts to improve
| the status quo. Like trying to find usages of a variable
| that is a substring of a word. Like trying to fix code
| where somebody else has used the variable name _i_. Sure,
| you can add matching on word boundaries, but then it starts
| to become a distraction from the original task at hand.
| SavantIdiot wrote:
| It can infer (x==y) if x=1 and y=1, which is grep cannot
| do.
| enriquto wrote:
| This must surely fail... isn't it equivalent to solving
| the halting problem? Or does it run the whole program
| like a debugger?
| Spivak wrote:
| It fails pretty fast. import random
| y = 0 def f(x): print(x)
| if random.randint(0,1) == 2: y = 1
| f(y)
|
| This not only fails but crashes the program.
| Quekid5 wrote:
| If false negatives are permitted the halting problem
| isn't an issue.
| [deleted]
___________________________________________________________________
(page generated 2021-04-22 23:00 UTC)