GH Archive

ReaLNero · on Aug 29, 2020

I actually used GH archive to mine github data! Two notes:

- The easiest way to access the data is using Google Cloud Platform -> BigQuery -> githubarchive. Google lets you write SQL queries for 1TB of the data for free. So you can filter or aggregate the data you want, then download it.

- This is the sad part. Github data is notoriously noisy, and not really valuable for data mining. [1] My work was on predicting GitHub collaborator skill using open-source collaboration data. Filtering out bots and people who use GitHub like a version of Google Drive was very difficult.

[1]: https://kblincoe.github.io/publications/2014_MSR_Promises_Pe...

lsinger · on Aug 29, 2020

Oh, hello there, co-author! :D

WayToDoor · on Aug 29, 2020

This is from the folks that make changelog nightly/weekly, a newsletter you can subscribe to to see what GitHub repositories were the most started in the last day/week. Nice work!

brianzelip · on Aug 29, 2020

It's actually from Ilya Grigorik.

Here's the Changelog podcast that featured Ilya Grigorik talking about GH Archive, and also touched on Changelog's use of the work for their Changelog Nightly feature, https://changelog.com/podcast/144.

rcshubhadeep · on Aug 29, 2020

Can you please give the link for subscription?

WayToDoor · on Aug 29, 2020

https://changelog.com/nightly for nightly and https://changelog.com/weekly for weekly.

jlgaddis · on Aug 29, 2020

If you don't want to subscribe, you can access any "nightly version" by going directly to

  http://nightly.changelog.com/YYYY/MM/DD/

For example: http://nightly.changelog.com/2020/08/28/

There's a broswable archive of the weekly version: https://changelog.com/weekly/archive

rsync · on Aug 29, 2020

I archive github repos, for my own purposes, into my rsync.net account:

  ssh user@rsync.net "git clone git://github.com/freebsd/freebsd.git freebsd"

mdaniel · on Aug 30, 2020

If you haven't already considered it, you might want to add -n to avoid the checkout (since the .git is the real value of that operation) and depending on your objective you might also want -r to pull down submodules in order to get the whole story about what's "in the repo"

brian_herman__ · on Aug 29, 2020

nice feature!

bethecloud · on Aug 30, 2020

There is a cool, similar project that stores a mirror of github on the decentralized cloud (STORJ) – here: https://gitbackup.org/#/

blindm · on Aug 29, 2020

What I don't understand about this:

> GH Archive is a project to record the public GitHub timeline

What does that even mean? SO it's basically a massive mirror of Github. Isn't that an enormous undertaking?

ReaLNero · on Aug 29, 2020

There are a lot of events on GitHub by users. I don't think they include the object blobs, so the timeline is actually pretty small (I think dataset was in terabytes).

blindm · on Aug 29, 2020

Oh thanks for clarifying. And here I was thinking this was a capture of all the binary blobs, which would be massive!