Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Semgrep: Lightweight static analysis for many languages (github.com/returntocorp)
202 points by kiyanwang on July 22, 2020 | hide | past | favorite | 28 comments


I work on Semgrep; there are a bunch of examples at https://semgrep.live if you're curious about what the syntax looks like.

For context, Semgrep started as a Facebook open-source project inspired from a Inria project named Coccinelle, which has has made a couple thousand or so automatic patches to the Linux kernel over the years using a semantic patch language (http://coccinelle.lip6.fr/sp.php)


> Semgrep started as a Facebook open-source project

Which project was this? I haven't heard of it before.


https://github.com/facebookarchive/pfff where it was named “sgrep”. pfff is maintained by @aryx who was the original author and is a Facebook alum, see https://github.com/returntocorp/pfff for the official fork


Impressive work!

Are there any plans to include C# or F#?


C# is high on the list, F# isn't a priority at the moment though. Behind the scenes, we've recently changed to use tree-sitter as the parser library; if there is a good F# tree-sitter library integration becomes quite easy. I don't see one at https://tree-sitter.github.io/tree-sitter/ but perhaps there's one maintained elsewhere.


Also C++ would be very nice.


We've been using semgrep for Zulip's python codebase for the last few months; here's our configuration:

https://github.com/zulip/zulip/blob/master/tools/semgrep.yml

I really appreciate the semantic checks. They're especially nice for security-sensitive lint rules, but really it removes the hacky regular expressions feel of adding lint rules to a codebase. It's also been useful for some codebase migrations (semgrep is more precise than e.g. `git grep -w` for finding "All the places we use code pattern X that we want to stop doing").

My main complaint about it is performance -- it's too slow per unit rule for us to replace the regular expression based system that we run on our whole codebase (so we can't happily convert our other ~100 regular expression-based lint rules to semgrep (https://github.com/zulip/zulip/blob/master/tools/linter_lib/...).

But performance has been improving a lot over time, and I think there's potential for it to be faster (E.g. mypy, the Python type-checker, has gotten way way faster in the last year or two). Because semgrep is getting active investment from a venture-funded company that I imagine will improve the performance, I expect semgrep to be a tool that most projects serious about code quality are using in a few years.

I should add that performance may also be less important to others than it is to us; we run all of our linters (currently 20 distinct linters, including eslint, prettier, pyflakes, isort, shellcheck, etc.) in parallel using https://github.com/zulip/zulint, with the goal of being able to lint the entire codebase in <30s or changed files in under 1s (obviously time depends on number of files changed).


I wonder if this could be improved by extracting fixed strings from the pattern and only actually parsing the files that could possibly match. I think the major issue would be alias support but even that should be possible for most languages as your fixed-string extraction would notice the alias itself.


Great idea! Will do that.


I had a good chuckle at :

> message: "Do not write a SQL injection vulnerability please"


Just went through the examples. Seems really intuitive and looks like it would be a good approach for homegrown linters. Would also love to see some plugin support for editors.


Agreed. What editors do you have in mind?

I filed a ticket for VS Code support because I’ve seen it mentioned in a few of the other comments: https://github.com/returntocorp/semgrep/issues/1329


VS Code and vim would be the ones I would be most concerned about as I typically jump between the two. Although a pre-commit hook is great and something I will definitely use, having this hook reporting issues in a more live manner would be a huge bonus.


Semgrep's pretty slick. I tried out a demo and I was pretty blown away by how I could essentially just guess my way to a signature.


I only recently came across Semgrep and then after that, Comby (https://comby.dev/).

Has anyone compared the 2? They seem similar (structured find/replace, with registries of rules).


Comby seems more like "parenthesis matching + search" (they don't implement a full parser for the language, just some basic required constructs to make a basic AST. I imagine this limits the resolution of the search?

Semgrep uses an AST that's equivalent to the parser of the language itself so it's much higher resolution in terms of what you can match.


Ah yeah, that is a strong distinction. Comby seems have a little nicer UX, but then as you've said, it would have a lower matching resolution.

That explains why too that Comby supports so many languages so easily, and how easy it is to add your own DSL.


Thank you. That's great, it seems like it can't parse a full AST, but works with other languages, like C++.


Cool stuff! Seems to hook into tree-sitter?

Love seeing OCaml (or any functional language) :)


Regexes are such a horrible thing to deal with when you're just trying to parse code quickly and don't want to deal with AST. I've always wished for a library of regexes that just work.


I've always wondered if we could leverage the vast amount of GitHub code - that assumably all compiles without error or undefined behaviour on their master branches - train some sort of neural net to better catch syntax errors.

Has anyone done something like this, or am I riding the 2016 neural net hype train still?


This isn't specifically for syntax errors, but Jacob Jackson released TabNine [0] last year, which is an autocompleter trained on files from GitHub [1].

TabNine was acquired by Codota earlier this year [2].

[0] https://www.tabnine.com/

[1] https://www.tabnine.com/blog/deep/

[2] https://techcrunch.com/2020/04/27/codota-picks-up-12m-for-an...


Pretty amazing, and congrats to Jacob Jackson. (I may be a little envious) ;)


Nice to see more work in this direction. I used coccinelle a lot for automating changes/bug detection and I immediately missed it when working on anything that is not C.


Looks neat. Are you considering a flake8 extension like bandit for easy adoption (in CI and in VS Code)?


pip3 install semgrep fails on windows 10 with Python 3.7.8 and pip 20.1.1 and the error seems to be an invalid path separator char.

error: can't copy 'XXXXXXXXXXXXXX\Local\Temp\pip-install-cq40rzma\semgrep-files/semgrep-core': doesn't exist or not a regular file

Anyone here know how to fix that?


Semgrep should work on Windows Subsystem for Linux (WSL). Mind filing a ticket for myself and the other maintainers to help debug?

https://github.com/returntocorp/semgrep/issues/new?assignees...


Done




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: