Let me see if I get this straight... Developer has a Python script that takes 1 ...

tgsovlerkhgsel · on May 31, 2019

Developer receives an e-mail "we noticed something weird. btw here's our TOS. please explain".

Developer explains as requested.

Developer receives "k, we've restored your account", which sounds an awful lot like "what you're doing is fine".

Developer gets shut down again despite having explained the behavior as requested (i.e. the explanation is on file) and despite the explanation being considered sufficient by DO.

The "with the details you provided, we've removed the hold" e-mail is hard to interpret differently than "sorry for the misunderstanding, this is fine", especially as the initial e-mail asked them to explain to "ensure your account is not subjected to additional scrutiny or placed on billing hold". If they meant "ok but don't do that again" they should have stated so.

cptskippy · on June 1, 2019

My guess is that the account suspensions are an automated monitoring process and not someone sitting in a NOC monitoring activity.

Think of that automated monitoring like a circuit breaker in an electrical panel. If you plug an air compressor into a 15 amp circuit and it pops the breaker, once you flip the breaker back on would you run the compressor again? No? Then why would you think it ok perform the same activity that tripped up the automated system?

The OP seemed to be aware that spooling up 10 VMs to do whatever he did was what got him booted, so... why didn't he reach out to DO to find out exactly what alarms he tripped and then take action to either get his account whitelisted or modify his process to ensure it didn't trip the automated alarms again?

He just got his account unbanned (flipped the circuit breaker) and fired his job back up (turned the compressor back on).

aftbit · on May 31, 2019

Why would DO be upset about spinning up 10 VMs then spinning them down again? Isn't this exactly the point of cloud providers? This is what they bill me for, right?

SteveNuts · on May 31, 2019

Smaller VPS providers like Linode or DO oversubscribe like crazy. Last time I used Linode, they would email us telling us we're using too much CPU or memory, and we'd need to move to a larger tier VM.

Aaronn · on May 31, 2019

I think you misunderstood those emails. They are just there to help you if you didn't realize some process was stuck or something, they specifically say "This is not meant as a warning or a representation that you are misusing your resources." and you can also change the value that triggers those emails or disable them completely.

SteveNuts · on May 31, 2019

No it was definitely a ticket from their support, telling us we were noisy neighbors. They told us we needed to increase the size of our VMs.

cptskippy · on May 31, 2019

I guess it depends on what each of those VMs is doing. I don't trust the Dev's explanation tbh.

> to make it faster we execute it in parallel on multiple droplets ~10 that we set up only for this pipeline and shut down once it’s done.

Tildes preceding numbers means "approximately". Why doesn't he know exactly how many VMs he spun up?

Was he actually spinning up 1 VM per record and only allowing 10 VMs to be running concurrently?

I'm not a Pythong dev but why can't you execute 10 instances of Python on a single VM?

If you need to dedicate an entire VM to processing 1 row, what the hell is it doing?

developer2 · on June 1, 2019

My gut feeling agrees with "I don't trust the Dev's explanation tbh." Companies and developers often try to get away with doing Bad Things, and cry wolf publicly without offering up specifics about what they were trying to get away with. "Spin up 10 VMs for 500k rows of data" offers no explanation to just what those 10 VMs are doing. There is a big difference between "using memory and cpu" and "saturating the network in abusive ways".

Random speculation of one possibility: each of those 10 instances were suddenly doing something unexpected and spammy with the network. Maybe sending 500k+ emails (one per row of data claimed by the developer) over SMTP in a very short period, or jumping to massive spikes in torrent traffic, or crawling sites to scrape data (maybe each row of 500k is just a top-level domain name, and they crawl every URL on those domains, possibly turning 500k rows into hundreds of million of http requests).

The postmortem will be interesting. If DO is truly at fault here, that email after the second lockout saying the account is locked after review, no further details required... bad.

SirensOfTitan · on May 31, 2019

The developer is a 2 person team. Why would they use multiple clouds at that stage?

Additionally, if 10 spun up VMs is considered an “unreasonable” load on DigitalOcean infrastructure I shutter at the concept of building anything on the service. Does DigitalOcean even define “unreasonable” in their terms or is it kept vague?

bdcravens · on May 31, 2019

My understanding is that it isn't the 10 VMs as much as the resource usage (my suspicion is that DO is running a lot closer to the margin that larger providers, so they police this more). So they probably pegged all the CPUs at 100%. (perhaps a message queue approach would have been easier on the resources)

donarb · on June 1, 2019

He did mention Redis, so I assume some sort of message queueing was in play.

bdcravens · on June 1, 2019

Redis doesn't ensure it's being used with a message queue, but that's definitely a possibility.

jwilk · on June 1, 2019

What's wrong with pegging all the CPUs at 100%?

truncate · on May 31, 2019

`unreasonable load` sounds pretty vague. What counts as unreasonable? 10 VMs doesn't sound like much, and I believe if I'm renting a VM with XYZ specs, I should be allowed to use up-to max capacity it says so in specs. What am I missing here?

unilynx · on June 1, 2019

Also, digitalocean has a hard droplet (VMs) limit that requires a ticket to raise. You would expect them to be okay with you staying in that limit..

(And you cant just start 10 hugely expensive VMs either, the larger sizes are initially locked too)

mikeash · on June 1, 2019

Except it wasn’t “whatever you did, don’t do that.” It was, “this looks weird, can you explain?” Followed by “OK, sounds good, carry on.” Then they got shut down.

hinkley · on June 1, 2019

I get hives when I hear people porting all of their stuff to one cloud provider so they can get rid of their server rooms.

Are we getting to a point yet where that’s considered to be suicidal? I hope so.