Curious what the consensus is on how GH should have approached this to avoid such blowback.
Best case scenario, they explained in advance on the GH blog they're going to be doing some work on ML and coding, and they'd like people to opt into their profile being read via a flag setting/or put a file in the repo that gives permission like robots.txt? Second best case scenario, same as first but opt out vs opt in, and least ideal would be something like not doing the first two, however, when they announced it, explained in detail how the model was trained and what was used, why, and when- kinda thing?
Code (co)created with Copilot has to follow all the licenses of the source (heh) code. This generally means at the very least automatically including in projects getting help from Copilot a copy of all the licenses involved, and attribution for all the people the code of which Copilot has been trained on.
(Not sure for the cases where there is no license and therefore normal copyright applies, but AFAIK this isn't the case for any code on Github, which automatically gets an open source licence ?
EDIT : Code in public repositories seems to be "forkable" on Github itself but not copyable (to elsewhere). That's some nasty walled garden stuff right there, I wonder how legal that ToS is ? I could see how this could make them to incentivize people to stop using other licenses on Github, to not have to deal with this license mess... EEE yet again ?)
So I guess then, the first thing they should have done, is trained it to understand licenses, and used that as a first principle for how they built the system?
Seems to be too much effort (is it even possible to link the source to the end result ?), and might not be admissible, so just include a database with all of the relevant licenses and authors ?
Not really, consider for example repositories mirrored to Github.
It seems unclear who has the rights to grant this permission anyways (with free software licenses). Probably the copyright holder? Who that is might also be complicated.
In that hypothetical I wouldn’t think GitHub is responsible for determining if a repository is mirrored and what the implications of that are. They just need to look at what license is on the repo in GitHub.
Good point, I would have thought GH requires you to agree in some TOS that you have permission to put the code on GH (but I don't know)? If so, could that point be put aside? (I'm not a software engineer so sorry if that made no sense. Super curious about the whole codepilot thing from a business and community perspective)
This is the complicated bit: All open-source licenses grant you permission to redistribute the code (usually with stipulations like having to include the license), so you are almost always allowed to upload the code to Github.
What it doesn't mean however is that you're the copyright holder of that code, you're merely redistributing work that somebody else has ownership of.
So who gets to decide what Github is allowed to do with it?
I expect this will end up in courts and we won't get a definite answer before that.
If you'll entertain me on a hypothetical for a moment. Suppose then the copious amount of intelligent folks over at GH know this will eventually end up in the courts, and expected that from the start. Would you suggest they messaged/rolled it out any differently? Did they do exactly what they needed to do so that it did end up in the courts? Should they have done anything differently to not piss folks off so much? Sorry for the million questions, you seem to know/have thought a bit about this. Thanks! :)
They should have only used code from projects that included a license that allow for commercial use or made their model openly available and/or free to use
Best case scenario, they explained in advance on the GH blog they're going to be doing some work on ML and coding, and they'd like people to opt into their profile being read via a flag setting/or put a file in the repo that gives permission like robots.txt? Second best case scenario, same as first but opt out vs opt in, and least ideal would be something like not doing the first two, however, when they announced it, explained in detail how the model was trained and what was used, why, and when- kinda thing?
Is that generally about right, or..?