It is indeed Information Retrieval 101 level stuff which leads to the question of why this is the best GitHub can do with all the resources of Microsoft behind them. It's almost useless, at least for C++. It can't tell the difference between foo(int) and foo(double) or this::foo vs. that::foo.
If I wanted the kind of search engine I can get a teenager to write in 16 weeks why would I expect my org to be paying $$$ for the service?
What a shit take. The article itself is perhaps a nice light overview of 101-ish level concepts, although knowing how and when to apply them in a real engineering context is not something I would consider 101 level. And certainly, building something that is actually at the scale of GitHub Search is nowhere near 101 level.
Have you tried the new search? Thanks to the variable length ngram indexing mentioned in the post, it can handle all of those cases. Sign up here to try it: https://github.com/features/code-search
Symbol extraction for C and C++ is currently disabled because we were having problems with the performance of the tree-sitter queries we were using, but we are planning to bring that back.
Sorry, it cannot handle any of those cases. You're talking about the ability to find the literal `this::foo` but that's not how it would normally appear. It normally will appear anywhere inside a `namespace this` scope, which cs.github does not grok. And cs.github cannot address finding the definition related to a given call site. It doesn't even try.
Grimoire (I see that in the HTML and HTTP requests for cs.chromium.org and cs.android.com, so I presume that's what it's called) is really cool, although sadly obviously not OSS. It completely falls apart when faced with JS, which is increasingly being checked in as part of frontend and Mojo glue code, so it's (from the perspective of an outsider trying to get their feet wet) a bit creaky, but being able to click around in C++, which I don't really understand at this point, and learn something new almost every time, is really cool, and IMO representative of at least one concrete beneficial outcome whenever you do get to this.
I wonder if it would be possible to leverage LSP as a kind of tokenization generalization framework, or even piggyback off of the existing effort by incorporating search-friendly/-helpful metadata into future versions of the protocol spec.
"all the resources of Microsoft" doesn't really say anything about the size of the team involved here. Frankly, it sounds like a pretty small one: clearly GitHub is a successful business with many customers and a significant user base even with very basic code search.
It seems to me that you know what you want from such a service, but focusing on making C++ exceptionally great in this service would come at the cost of, say, general quality across all languages, or frontend usability. A very reasonable tradeoff for a beta-quality product.
If I wanted the kind of search engine I can get a teenager to write in 16 weeks why would I expect my org to be paying $$$ for the service?