If BGP is busted, those errors were probably stuff like connection timeouts on s...

cowsandmilk · on March 15, 2020

> if a rack's temp has raised but there's no customer error, is it an immediate problem?

Here, the problem was on a machine running a google application, so they noticed. But this is a post on the google cloud blog. This just makes me think that Google isn’t monitoring the health of the hardware they provide to customers in the cloud. It is a change you have to make when you change the layer at which you are providing services to customers. If I’m using the google maps web site, I don’t care if they are monitoring cpu temperature if layers above insulate me from impact. If I’m spinning up a virtual machine, I will be directly impacted.

jldugger · on March 15, 2020

Netflix keynotes described how entire AWS AZs can and will go offline, and how to induce failures to exercise recovery paths, so why is the evaluation criteria for GCP 'my single point of failure VM cannot ever go down?'

AlphaSite · on March 15, 2020

Most companies are not Netflix and I’m not sure I understand why we are discussing AWS?

That post is more a statement on how errors which can be handled at the app layer can have catastrophic effects on lower level components. You cannot assume end customers are running thing at the scale or with the fault tolerance of google, or Netflix.

jldugger · on March 16, 2020

Most companies are not Netflix but all cloud customers can learn from their public design discussions. The only reason I mentioned AWS is that they are Netflix a high profile AWS customer, and their lessons in cloud architecture apply pretty cleanly to GCP. You cannot assume an SLA of 100 percent, even if it works out that way on shorter time scales. It's really no different than running your own datacenter, so I don't know where this 'monitoring will prevent catastrophy' fight is coming from.

> You cannot assume end customers are running thing at the scale or with the fault tolerance of google, or Netflix.

Correct, but there's a gradient between 'we have 10 copies of the service in 10 different countries and use Akamai GTM in case of outage' and Dave's one-off-VM. One-off VMs are fine if you know what you're getting into, and I use that setup for my personal, lowstakes & zero revenue website. But if you are a paying cloud customer, it makes sense to pay attention to availability zones regardless of scale.

And sure, there might be a market somewhere for a more durable VM setup. At a past non-profit job we provided customers the illusion of a single HA VM using Ganeti (http://www.ganeti.org/). But it's not clear to me that the segment is viable -- customers at the low and top end don't need the HA.

joshuamorton · on March 15, 2020

Certainly, but you're making the assumption that Google does assume that.

Its completely possible to have different alerting setups for different binaries, and consider CPU throttling to not be a page-worthy issue for GFEs, but to be page-worthy for GCP jobs.

Dylan16807 · on March 16, 2020

> The real question is one of prioritization: if a rack's temp has raised but there's no customer error, is it an immediate problem?

A server CPU that's thermal throttling is about one step less serious than shutting off. While it's not urgent to deal with dead machines, a dying machine should be in the top priority bracket you see on a day-to-day basis.