the combination of the Import and Call Graphs (https://github.com/github/semantic#language-support) should allow for the construction of a view of what symbols are imported, and where in a given project they are called. resolving those edges across repos is also possible, in many cases.
We've spent years building semantic code intel and search that's available for open source and software teams that aren't inside Google, which has an amazing internal code explorer powered by Kythe (though, unsurprisingly, a lot of our users and customers are ex-Googlers!)
Much thanks to GitHub for open-sourcing this. Our own open-source code analysis stack uses tree-sitter (as this does) in addition to other language analyzers via LSP, and this seems like a great contribution to the open-source code analysis canon!
Can these tools reason about events, async, promises, threads, queues, all these kinds of things? I think the plain graph representation will tell just a minuscule of a program semantics. O(n) complexity analysis where "O" stands for Order wouldn't work here too. You have to somehow map those asynchronous pieces of code onto a separate sequential spaces and only then recognize similarities or do whatever analysis needed.
I've been thinking, just a static analysis is not enough, gathering profiling info and data types from the runtime would help a lot, and some IDEs have been doing this already.
Could you shed some light on what sort of an internal representation those analysis tools use, is it something like lambda-calculus abstraction?
The internal representation depends on the tool and application. Here are a few examples:
* For deeper analysis, we often use the same internal data structures that the compiler itself uses. This includes data structures that capture AST and type information. In the case of some languages, the compiler doesn't capture enough information (e.g., dynamically typed languages with no static typing). In those cases, the data structures are often enhanced/similar-but-not-quite-the-same as what the interpreter/compiler uses.
* In "shallower" analysis, you can use a language-agnostic parser to generate a generic AST. There's various techniques for doing so in a language-independent fashion, depending on what tradeoffs you want to make with speed, accuracy, and generality. The tree-sitter library is one such example. You can also use tools like CTAGS + textual heuristics in the extremely shallow case.
* Then, there's the Kythe approach, which constructs a taxonomy of source code graph nodes and edges that aims to capture every possible relationship in every language. Each language requires an integration to convert from the source code in that language to the Kythe schema; that integration can use either the "shallow" or "deep" approaches described above, and the right answer depends again on your tradeoff of speed vs. accuracy.
As for combining dynamic information like profiling and runtime info, this is something we've been thinking about for awhile. We think there's plenty of low-hanging fruit that combines static information (types, AST, source code) with runtime info (stack traces, profiling info). We've a couple of efforts in motion toward this end. One is the Sourcegraph extension API (https://sourcegraph.com/extensions), which allows injection of third-party info, often from runtime tools, into wherever you view code, whether it's in your editor, on GitHub, or on Sourcegraph. Another is allowing a user to plugin stacktrace data to heuristically select which jump-to-def/find-references options are most likely to be useful given what your currently debugging.
I'd love to hear what you're thinking of—care to share your ideas in this space?
I thought of making a tool for language-agnostic verification and/or refactoring advisory. That would be a modular system with a design looking something like:
AST -[basis]-> Normalized AST -[to sequence]-> Common Representation [graph; embeddings] -> [Code Structure DB; Temporal DB(reflects what's a code actually doing, runtime info involved)] <-[query for anomalies]
The idea is simple, we teach some database, feeding with lots of source codes. We may want also to define some rules, like, what's a deadlock, what's an event, etc. I've no clear idea how to implement the rules, but it should be possible to make an embeddings for the temporal data and map it onto a space where checking for correctness can be done in linear time, fingers crossed. But here comes the question, what's a temporal db should be? Should it be a separate data definition language, or code is the data?
As for basis, we're supposed either to use some generic calculus and its subset for the particular programming language, or somehow make it possible to aggregate into the same database with different bases.
Obviously this approaches ask for machine learning instead of traversing the graphs. I think there's good reason for that. For example, if I've written many chunks of code which actually do the same thing, it'd have taken enormous time to prove that using combinatoric approach (graph self-similarities, NP).
This could be the basis for better indexing and search.
If you're working with a monorepo with hundreds of millions of lines of Java and such, and you're looking for "all the places where getName() gets called for <some interface>", today you might as well give up.
Yes... and no. See the thing about hundreds of millions of loc. You're not going to load the entire monorepo into the IDE -- it's not practical. But having an OpenGrok or whatever index of the monorepo is reasonable.
Something related I've been thinking about lately is combining something like this with plagiarism detection tools to weed out code duplication. I have no idea yet quite how, but it would be interesting to explore.
My goal is slightly different, though -- I'm more interested in semantic similarity than source similarity. I saw one article suggesting to run the detection algorithm on the decompiled output of the compiler to get better results, which seems like a good idea as well.
very doable, at least for certain types of clones, and a topic of active research.
while leveraging the ASTs and scope graphs produced by semantic can allow you to attack the more complicated clone types (eg, code that has nearly the same meaning, with significantly different implementation), various parsing + hashing methods have proven useful for the more simple cases.
useful for far more than detecting plagiarism, too. it can boost signal for search, allow for more nuanced semantic navigation, assist in refactoring, and help understand the propagation/provenance of code (which can be important for understanding licenses, etc).
Trivia: UN and EU proceedings are the perfect translation dataset as they are already translated paragraph-by-paragraph into all their official languages. However, they are all highly formal and have a legalistic vocabulary that won't be useful for most translation users.
http://www.program-transformation.org/Transform/CodeCrawler
(And MOOSE)