I actually used GH archive to mine github data! Two notes:
- The easiest way to access the data is using Google Cloud Platform -> BigQuery -> githubarchive. Google lets you write SQL queries for 1TB of the data for free. So you can filter or aggregate the data you want, then download it.
- This is the sad part. Github data is notoriously noisy, and not really valuable for data mining. [1] My work was on predicting GitHub collaborator skill using open-source collaboration data. Filtering out bots and people who use GitHub like a version of Google Drive was very difficult.
This is from the folks that make changelog nightly/weekly, a newsletter you can subscribe to to see what GitHub repositories were the most started in the last day/week. Nice work!
Here's the Changelog podcast that featured Ilya Grigorik talking about GH Archive, and also touched on Changelog's use of the work for their Changelog Nightly feature, https://changelog.com/podcast/144.
If you haven't already considered it, you might want to add -n to avoid the checkout (since the .git is the real value of that operation) and depending on your objective you might also want -r to pull down submodules in order to get the whole story about what's "in the repo"
There are a lot of events on GitHub by users. I don't think they include the object blobs, so the timeline is actually pretty small (I think dataset was in terabytes).
- The easiest way to access the data is using Google Cloud Platform -> BigQuery -> githubarchive. Google lets you write SQL queries for 1TB of the data for free. So you can filter or aggregate the data you want, then download it.
- This is the sad part. Github data is notoriously noisy, and not really valuable for data mining. [1] My work was on predicting GitHub collaborator skill using open-source collaboration data. Filtering out bots and people who use GitHub like a version of Google Drive was very difficult.
[1]: https://kblincoe.github.io/publications/2014_MSR_Promises_Pe...