Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GH Archive (gharchive.org)
90 points by bluu00 on Aug 29, 2020 | hide | past | favorite | 14 comments


I actually used GH archive to mine github data! Two notes:

- The easiest way to access the data is using Google Cloud Platform -> BigQuery -> githubarchive. Google lets you write SQL queries for 1TB of the data for free. So you can filter or aggregate the data you want, then download it.

- This is the sad part. Github data is notoriously noisy, and not really valuable for data mining. [1] My work was on predicting GitHub collaborator skill using open-source collaboration data. Filtering out bots and people who use GitHub like a version of Google Drive was very difficult.

[1]: https://kblincoe.github.io/publications/2014_MSR_Promises_Pe...


Oh, hello there, co-author! :D


This is from the folks that make changelog nightly/weekly, a newsletter you can subscribe to to see what GitHub repositories were the most started in the last day/week. Nice work!


It's actually from Ilya Grigorik.

Here's the Changelog podcast that featured Ilya Grigorik talking about GH Archive, and also touched on Changelog's use of the work for their Changelog Nightly feature, https://changelog.com/podcast/144.


Can you please give the link for subscription?



If you don't want to subscribe, you can access any "nightly version" by going directly to

  http://nightly.changelog.com/YYYY/MM/DD/
For example: http://nightly.changelog.com/2020/08/28/

There's a broswable archive of the weekly version: https://changelog.com/weekly/archive


I archive github repos, for my own purposes, into my rsync.net account:

  ssh user@rsync.net "git clone git://github.com/freebsd/freebsd.git freebsd"


If you haven't already considered it, you might want to add -n to avoid the checkout (since the .git is the real value of that operation) and depending on your objective you might also want -r to pull down submodules in order to get the whole story about what's "in the repo"


nice feature!


There is a cool, similar project that stores a mirror of github on the decentralized cloud (STORJ) – here: https://gitbackup.org/#/


What I don't understand about this:

> GH Archive is a project to record the public GitHub timeline

What does that even mean? SO it's basically a massive mirror of Github. Isn't that an enormous undertaking?


There are a lot of events on GitHub by users. I don't think they include the object blobs, so the timeline is actually pretty small (I think dataset was in terabytes).


Oh thanks for clarifying. And here I was thinking this was a capture of all the binary blobs, which would be massive!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: