Semgrep: Lightweight static analysis for many languages

ievans · on July 22, 2020

I work on Semgrep; there are a bunch of examples at https://semgrep.live if you're curious about what the syntax looks like.

For context, Semgrep started as a Facebook open-source project inspired from a Inria project named Coccinelle, which has has made a couple thousand or so automatic patches to the Linux kernel over the years using a semantic patch language (http://coccinelle.lip6.fr/sp.php)

ipsum2 · on July 22, 2020

> Semgrep started as a Facebook open-source project

Which project was this? I haven't heard of it before.

ievans · on July 22, 2020

https://github.com/facebookarchive/pfff where it was named “sgrep”. pfff is maintained by @aryx who was the original author and is a Facebook alum, see https://github.com/returntocorp/pfff for the official fork

_wems · on July 23, 2020

Impressive work!

Are there any plans to include C# or F#?

ievans · on July 24, 2020

C# is high on the list, F# isn't a priority at the moment though. Behind the scenes, we've recently changed to use tree-sitter as the parser library; if there is a good F# tree-sitter library integration becomes quite easy. I don't see one at https://tree-sitter.github.io/tree-sitter/ but perhaps there's one maintained elsewhere.

carlmr · on July 23, 2020

Also C++ would be very nice.

tabbott · on July 22, 2020

We've been using semgrep for Zulip's python codebase for the last few months; here's our configuration:

https://github.com/zulip/zulip/blob/master/tools/semgrep.yml

I really appreciate the semantic checks. They're especially nice for security-sensitive lint rules, but really it removes the hacky regular expressions feel of adding lint rules to a codebase. It's also been useful for some codebase migrations (semgrep is more precise than e.g. `git grep -w` for finding "All the places we use code pattern X that we want to stop doing").

My main complaint about it is performance -- it's too slow per unit rule for us to replace the regular expression based system that we run on our whole codebase (so we can't happily convert our other ~100 regular expression-based lint rules to semgrep (https://github.com/zulip/zulip/blob/master/tools/linter_lib/...).

But performance has been improving a lot over time, and I think there's potential for it to be faster (E.g. mypy, the Python type-checker, has gotten way way faster in the last year or two). Because semgrep is getting active investment from a venture-funded company that I imagine will improve the performance, I expect semgrep to be a tool that most projects serious about code quality are using in a few years.

I should add that performance may also be less important to others than it is to us; we run all of our linters (currently 20 distinct linters, including eslint, prettier, pyflakes, isort, shellcheck, etc.) in parallel using https://github.com/zulip/zulint, with the goal of being able to lint the entire codebase in <30s or changed files in under 1s (obviously time depends on number of files changed).

kevincox · on July 23, 2020

I wonder if this could be improved by extracting fixed strings from the pattern and only actually parsing the files that could possibly match. I think the major issue would be alias support but even that should be possible for most languages as your fixed-string extraction would notice the alias itself.

aryx · on July 23, 2020

Great idea! Will do that.

anitil · on July 23, 2020

I had a good chuckle at :

> message: "Do not write a SQL injection vulnerability please"

stephen-bunn · on July 22, 2020

Just went through the examples. Seems really intuitive and looks like it would be a good approach for homegrown linters. Would also love to see some plugin support for editors.

dlukeomalley · on July 22, 2020

Agreed. What editors do you have in mind?

I filed a ticket for VS Code support because I’ve seen it mentioned in a few of the other comments: https://github.com/returntocorp/semgrep/issues/1329

stephen-bunn · on July 23, 2020

VS Code and vim would be the ones I would be most concerned about as I typically jump between the two. Although a pre-commit hook is great and something I will definitely use, having this hook reporting issues in a more live manner would be a huge bonus.

staticassertion · on July 22, 2020

Semgrep's pretty slick. I tried out a demo and I was pretty blown away by how I could essentially just guess my way to a signature.

dorian-graph · on July 22, 2020

I only recently came across Semgrep and then after that, Comby (https://comby.dev/).

Has anyone compared the 2? They seem similar (structured find/replace, with registries of rules).

rusbus · on July 23, 2020

Comby seems more like "parenthesis matching + search" (they don't implement a full parser for the language, just some basic required constructs to make a basic AST. I imagine this limits the resolution of the search?

Semgrep uses an AST that's equivalent to the parser of the language itself so it's much higher resolution in terms of what you can match.

dorian-graph · on July 23, 2020

Ah yeah, that is a strong distinction. Comby seems have a little nicer UX, but then as you've said, it would have a lower matching resolution.

That explains why too that Comby supports so many languages so easily, and how easy it is to add your own DSL.

carlmr · on July 23, 2020

Thank you. That's great, it seems like it can't parse a full AST, but works with other languages, like C++.

vmchale · on July 22, 2020

Cool stuff! Seems to hook into tree-sitter?

Love seeing OCaml (or any functional language) :)

ccktlmazeltov · on July 22, 2020

Regexes are such a horrible thing to deal with when you're just trying to parse code quickly and don't want to deal with AST. I've always wished for a library of regexes that just work.

glouwbug · on July 23, 2020

I've always wondered if we could leverage the vast amount of GitHub code - that assumably all compiles without error or undefined behaviour on their master branches - train some sort of neural net to better catch syntax errors.

Has anyone done something like this, or am I riding the 2016 neural net hype train still?

karlding · on July 23, 2020

This isn't specifically for syntax errors, but Jacob Jackson released TabNine [0] last year, which is an autocompleter trained on files from GitHub [1].

TabNine was acquired by Codota earlier this year [2].

[0] https://www.tabnine.com/

[1] https://www.tabnine.com/blog/deep/

[2] https://techcrunch.com/2020/04/27/codota-picks-up-12m-for-an...

glouwbug · on July 23, 2020

Pretty amazing, and congrats to Jacob Jackson. (I may be a little envious) ;)

estebarb · on July 22, 2020

Nice to see more work in this direction. I used coccinelle a lot for automating changes/bug detection and I immediately missed it when working on anything that is not C.

lsorber · on July 22, 2020

Looks neat. Are you considering a flake8 extension like bandit for easy adoption (in CI and in VS Code)?

skanga · on July 22, 2020

pip3 install semgrep fails on windows 10 with Python 3.7.8 and pip 20.1.1 and the error seems to be an invalid path separator char.

error: can't copy 'XXXXXXXXXXXXXX\Local\Temp\pip-install-cq40rzma\semgrep-files/semgrep-core': doesn't exist or not a regular file

Anyone here know how to fix that?

dlukeomalley · on July 22, 2020

Semgrep should work on Windows Subsystem for Linux (WSL). Mind filing a ticket for myself and the other maintainers to help debug?

https://github.com/returntocorp/semgrep/issues/new?assignees...

skanga · on July 23, 2020