Semantic: Parsing, analyzing, and comparing source code across many languages

stcredzero · on May 31, 2019

The late 20th century, early 00's version:

http://www.program-transformation.org/Transform/CodeCrawler

(And MOOSE)

anentropic · on May 31, 2019

Looks very interesting - would benefit from showing some examples and/or use cases

bentona · on May 31, 2019

Agreed. Hoping/wondering if this could be used to generate dependency graphs for Ruby projects?

sixwing · on May 31, 2019

the combination of the Import and Call Graphs (https://github.com/github/semantic#language-support) should allow for the construction of a view of what symbols are imported, and where in a given project they are called. resolving those edges across repos is also possible, in many cases.

beyang · on May 31, 2019

If folks are interested in this sort of thing, either because they want to use it or because they want to hack on it, I'd humbly suggest checking out Sourcegraph: https://sourcegraph.com, https://github.com/sourcegraph/sourcegraph.

We've spent years building semantic code intel and search that's available for open source and software teams that aren't inside Google, which has an amazing internal code explorer powered by Kythe (though, unsurprisingly, a lot of our users and customers are ex-Googlers!)

Much thanks to GitHub for open-sourcing this. Our own open-source code analysis stack uses tree-sitter (as this does) in addition to other language analyzers via LSP, and this seems like a great contribution to the open-source code analysis canon!

Eli_P · on June 1, 2019

Can these tools reason about events, async, promises, threads, queues, all these kinds of things? I think the plain graph representation will tell just a minuscule of a program semantics. O(n) complexity analysis where "O" stands for Order wouldn't work here too. You have to somehow map those asynchronous pieces of code onto a separate sequential spaces and only then recognize similarities or do whatever analysis needed.

I've been thinking, just a static analysis is not enough, gathering profiling info and data types from the runtime would help a lot, and some IDEs have been doing this already.

Could you shed some light on what sort of an internal representation those analysis tools use, is it something like lambda-calculus abstraction?

beyang · on June 1, 2019

The internal representation depends on the tool and application. Here are a few examples:

* For deeper analysis, we often use the same internal data structures that the compiler itself uses. This includes data structures that capture AST and type information. In the case of some languages, the compiler doesn't capture enough information (e.g., dynamically typed languages with no static typing). In those cases, the data structures are often enhanced/similar-but-not-quite-the-same as what the interpreter/compiler uses.

* In "shallower" analysis, you can use a language-agnostic parser to generate a generic AST. There's various techniques for doing so in a language-independent fashion, depending on what tradeoffs you want to make with speed, accuracy, and generality. The tree-sitter library is one such example. You can also use tools like CTAGS + textual heuristics in the extremely shallow case.

* Then, there's the Kythe approach, which constructs a taxonomy of source code graph nodes and edges that aims to capture every possible relationship in every language. Each language requires an integration to convert from the source code in that language to the Kythe schema; that integration can use either the "shallow" or "deep" approaches described above, and the right answer depends again on your tradeoff of speed vs. accuracy.

As for combining dynamic information like profiling and runtime info, this is something we've been thinking about for awhile. We think there's plenty of low-hanging fruit that combines static information (types, AST, source code) with runtime info (stack traces, profiling info). We've a couple of efforts in motion toward this end. One is the Sourcegraph extension API (https://sourcegraph.com/extensions), which allows injection of third-party info, often from runtime tools, into wherever you view code, whether it's in your editor, on GitHub, or on Sourcegraph. Another is allowing a user to plugin stacktrace data to heuristically select which jump-to-def/find-references options are most likely to be useful given what your currently debugging.

I'd love to hear what you're thinking of—care to share your ideas in this space?

Eli_P · on June 2, 2019

I thought of making a tool for language-agnostic verification and/or refactoring advisory. That would be a modular system with a design looking something like:

AST -[basis]-> Normalized AST -[to sequence]-> Common Representation [graph; embeddings] -> [Code Structure DB; Temporal DB(reflects what's a code actually doing, runtime info involved)] <-[query for anomalies]

The idea is simple, we teach some database, feeding with lots of source codes. We may want also to define some rules, like, what's a deadlock, what's an event, etc. I've no clear idea how to implement the rules, but it should be possible to make an embeddings for the temporal data and map it onto a space where checking for correctness can be done in linear time, fingers crossed. But here comes the question, what's a temporal db should be? Should it be a separate data definition language, or code is the data?

As for basis, we're supposed either to use some generic calculus and its subset for the particular programming language, or somehow make it possible to aggregate into the same database with different bases.

Obviously this approaches ask for machine learning instead of traversing the graphs. I think there's good reason for that. For example, if I've written many chunks of code which actually do the same thing, it'd have taken enormous time to prove that using combinatoric approach (graph self-similarities, NP).

marktangotango · on June 1, 2019

My understanding it just an AST over which one could implement the types of analysis you describe. I would like to have more detail as well.

cryptonector · on May 31, 2019

This could be the basis for better indexing and search.

If you're working with a monorepo with hundreds of millions of lines of Java and such, and you're looking for "all the places where getName() gets called for <some interface>", today you might as well give up.

haberman · on May 31, 2019

Internally at Google we have this functionality. We use Kythe, which is open-sourced: https://kythe.io/

saagarjha · on May 31, 2019

Don't a lot of IDEs offer this functionality?

cryptonector · on June 1, 2019

Yes... and no. See the thing about hundreds of millions of loc. You're not going to load the entire monorepo into the IDE -- it's not practical. But having an OpenGrok or whatever index of the monorepo is reasonable.

sixwing · on May 31, 2019

it's definitely useful for structural code annotation, which can be used by both search and code navigation. stay tuned.

kqr · on May 31, 2019

Something related I've been thinking about lately is combining something like this with plagiarism detection tools to weed out code duplication. I have no idea yet quite how, but it would be interesting to explore.

Darmani · on June 1, 2019

That's called MOSS ("measure of source similarity"), and it's been used by universities for over a decade.

kqr · on June 1, 2019

My goal is slightly different, though -- I'm more interested in semantic similarity than source similarity. I saw one article suggesting to run the detection algorithm on the decompiled output of the compiler to get better results, which seems like a good idea as well.

Darmani · on June 5, 2019

Oh, I misunderstood. That's called "semantic clone detection," and it's been a breeding ground for second-tier papers for a long time.

sixwing · on May 31, 2019

very doable, at least for certain types of clones, and a topic of active research.

while leveraging the ASTs and scope graphs produced by semantic can allow you to attack the more complicated clone types (eg, code that has nearly the same meaning, with significantly different implementation), various parsing + hashing methods have proven useful for the more simple cases.

useful for far more than detecting plagiarism, too. it can boost signal for search, allow for more nuanced semantic navigation, assist in refactoring, and help understand the propagation/provenance of code (which can be important for understanding licenses, etc).

freddref · on May 31, 2019

What alternatives exist and how do they compare? ( For example, jsinspect finds structurally similar code )

enriched · on May 31, 2019

I think that kythe(https://kythe.io/) is somewhat similar. But I think that it has more of an emphasis on the graph schema.

axiom92 · on May 31, 2019

Read the title, thought it's about natural languages and got VERY excited ;(

wry_discontent · on May 31, 2019

Would it be feasible to build something like that?

lozenge · on June 1, 2019

Trivia: UN and EU proceedings are the perfect translation dataset as they are already translated paragraph-by-paragraph into all their official languages. However, they are all highly formal and have a legalistic vocabulary that won't be useful for most translation users.

axiom92 · on May 31, 2019

I think one could do something interesting along these lines with machine translation dataset.