Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Cynic in means says it's two main things:

1. Effectively serves as a walled garden for copilot to make sure that competitors can't use their data (your code) in coding assistants, knowledge discovery, etc.

2. Ensure that competitors who are using github are maximally mineable so they can easily discover and implement their solutions to eat their lunch (much like OpenAI's strategy here)

This is covered up by the excuse of performance, but it's fundamentally no different than the loginwalls we saw go up around twitter and reddit

... except this time done by a big industry player who LARPs as part of the open source community but is positioned to cannibalize it from within.



Shouldn't the competitors just clone the repository locally and train their model on it instead of relying search api which probably cost more computation resources?


You could, yes, and probably many are doing this, but you now have to git pull on all of those if you want to, say, know which LLM libraries are currently trending, or how quickly PopularLib v0.2 is being used in codebases related to Y, etc.

IMO It's much less about the legacy code (there exist already terabyte size datasets that take in a lot of things on github) and MUCH more about how up-to-date your LLM/AI is with new repos, "best practices" (or most common practices), etc.

Plus you often get "LLM Code poisoning" from older training data as it attempts to use functionality which has experienced breaking changes versus the current stable release. Current is King.

Also there's the whole goldmine of github discussions, issues, etc that a repo just.. doesn't have.

Right now you can still easily index those (though iirc they sometimes ban datacenter IPs), but they may also fall victim to the loginwall.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: