Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Two is One, One is None. There are absolutely ways around this, it's called redundancy. The marginal cost of laying an extra pair during physical plant installation is basically $0, which is why you'd never go "well we need a backup for the backup, so there's no point in having two pairs). Similarly, the marginal cost for having a second UPS and PDU in a rack is effectively $0 at scale, so nobody would argue this is unnecessary to deal with possible UPS failure or accidentally unplugging a cable.

In this case, there are likely several things that can be changes systemically to mitigate or prevent similar failures in the future, and I have every faith that Facebook's SRE team is capable of identifying and implementing those changes. There is no such thing as "no way around it", unless you're dealing with a law of physics.



By "no way around it" I mean you're going to need to create a circular dependency at some point, whether it's a maintenance network that's used to manage itself, or the prod network for managing the maintenance network.

I absolutely agree that installing a maintenance network is a good idea. One of the big challenges, though, is making sure that all your tooling can and will run exclusively on the maintenance network if needed.

(Also, while the marginal cost of laying an extra pair of fiber during physical installation may be low, making sure that you have fully independent failure domains is much higher, whether that's leased fiber, power, etc.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: