Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Did you consider using a mirror network, with servers run by external organizations, instead of going with AWS bandwidth for rubygems? Seems like that would be a good approach for the static/bulk part of your dataset, and there are lots of companies and universities who are set up to serve software. (The mirror I manage serves about 50 TB/month for several Linux distros, and many sites are larger.) Do the work and infrastructure required to manage these networks make them not worthwhile?

Edit: Found a post [0] calling for a rubygems mirror network. Otherwise there is lots of information about setting up local mirrors of the repository.

[0] http://binarymentalist.com/post/1314642927/proposal-we-have-...



It's been discussed many times before, yes. Rubygems usage pattern by our users make any kind of mirror delay unacceptable. We currently run a number of mirrors, configured as caching proxies. I want to get us going on a CDN like Fastly soon because they provide effectively the same functionality but distributed to many, many more POPs that I will ever setup.


I suspect mirror delay is less of an issue than you might perceive it to be. Many CPAN mirrors manage to stay within tens of seconds/no more than a minute from the main CPAN mirror that PAUSE publishes to.


If it's just the sync delay, you could track each mirror's last-updated time and only direct users to a mirror that had synchronized with the master since the package in question was released. Otherwise, serve the content from AWS. Though I'm sure this couldn't beat the service that Fastly's donating.


The caching mirror configuration achieves nearly the same thing. In the past, people have wanted to run their own mirrors that we directed people to, but that's got reliability and security issues.


Mirrors shouldn't be a security concern, the signatures of packages should come from "headquarters", same goes for reliability, clients should be able to, and SHOULD pull from multiple sites simultaneously.


Even if package signing works perfectly, when I connect to a mirror and request a patch for foo, the mirror learns my IP address and the fact I have an as-yet-unpatched version of foo.


Very true on the signatures. Using multiple sites isn't necessary though, imho.


I could be wrong, but it seems like a nice hack to pull for say 3 mirrors at the same time at some offset into the resource using a range get for say, 16k each. The first one to complete does a pipelined request for another 16k slot and this process continues until the entire asset is downloaded. The fast mirrors would dominate, a small percentage of the bandwidth from slow mirrors would assist and truly slow mirrors would be ignored.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: