We had a similar incident with DO 8 days ago. It didn't kill our company, but we got hit hard.
Our business is Dynalist, an online outliner app. Many of our users store all their notes on Dynalist, so uptime is really important.
Starting 7 PM last Tuesday, we saw a slowdown in request handling. We filed a ticket with DO 2 hours after that (we also posted our initial tweet to keep our users informed: https://twitter.com/DynalistHQ/status/1131087411797270529).
A few hours later, we started to experience full downtime. Still no reply from DO. We filed another ticket with the prefix "[URGENT]". Still no reply.
We waited for 24 hours for their reply. We took turns taking naps because we're only a 2-person team.
After 24 hours, we tweeted @ DO (https://twitter.com/DynalistHQ/status/1131397013306847232). 2 hours later we finally got a support person working on our ticket. We didn't want to take it to the social media, but there doesn't seem to be any other way at that point. DO doesn't have phone support, and us "bumping" our support ticket didn't work either.
After 2 hours going back and forth on the support ticket and providing logs, DO's support person identified the issue and offered to move us to a less crowded server. They asked us what's a good time to do a manual migration if a live migration fails, and we replied immediately saying whenever is fine (we're experiencing downtime anyway).
We thought it's over, but we were so wrong.
They didn't reply in another 4 hours. That was 4 hours of more downtime. Sometimes, CPU steal is down a bit and our server could catch up some requests, although it would still take 10 seconds for our users to open Dynalist. But most of the time, our web app was totally inaccessible. Watching the charts on our dashboard go up and down felt like some of the hardest hours of my life... mainly because there's nothing we could do.
4 hours in, I realized we had to post another angry tweet to get a solution. There's nothing else to do other than trying to stay awake anyway. So I posted another tweet: https://twitter.com/DynalistHQ/status/1131497962184564737
This tweet didn't seem to work. Nothing happened in the next 3.5 hours and things started to feel surreal. I didn't know how much longer this downtime is going to last, and I didn't know what we were going to do about it.
At that time, it was 9:30 AM EDT and people were starting their day. We were getting more and more emails and tweets asking what is going on and where are their notes. A few customers were angry, but most were understanding and supportive.
At 9:55 AM EDT, DO finally did the live migration a few minutes before the time limit we gave them, which was 10 AM. That was the end of the incident; CPU steal was down to < 1% and Dynalist was finally up again.
However, we couldn't trust DO any more. This weekend we're migrating to a dedicated server provider which has phone and live chat support. DO is pretty good for spinning up a $5 box quickly to test something, but we learned the hard way we shouldn't rely on it.
Our business is Dynalist, an online outliner app. Many of our users store all their notes on Dynalist, so uptime is really important.
Starting 7 PM last Tuesday, we saw a slowdown in request handling. We filed a ticket with DO 2 hours after that (we also posted our initial tweet to keep our users informed: https://twitter.com/DynalistHQ/status/1131087411797270529).
A few hours later, we started to experience full downtime. Still no reply from DO. We filed another ticket with the prefix "[URGENT]". Still no reply.
We waited for 24 hours for their reply. We took turns taking naps because we're only a 2-person team.
After 24 hours, we tweeted @ DO (https://twitter.com/DynalistHQ/status/1131397013306847232). 2 hours later we finally got a support person working on our ticket. We didn't want to take it to the social media, but there doesn't seem to be any other way at that point. DO doesn't have phone support, and us "bumping" our support ticket didn't work either.
After 2 hours going back and forth on the support ticket and providing logs, DO's support person identified the issue and offered to move us to a less crowded server. They asked us what's a good time to do a manual migration if a live migration fails, and we replied immediately saying whenever is fine (we're experiencing downtime anyway).
We thought it's over, but we were so wrong.
They didn't reply in another 4 hours. That was 4 hours of more downtime. Sometimes, CPU steal is down a bit and our server could catch up some requests, although it would still take 10 seconds for our users to open Dynalist. But most of the time, our web app was totally inaccessible. Watching the charts on our dashboard go up and down felt like some of the hardest hours of my life... mainly because there's nothing we could do.
4 hours in, I realized we had to post another angry tweet to get a solution. There's nothing else to do other than trying to stay awake anyway. So I posted another tweet: https://twitter.com/DynalistHQ/status/1131497962184564737
This tweet didn't seem to work. Nothing happened in the next 3.5 hours and things started to feel surreal. I didn't know how much longer this downtime is going to last, and I didn't know what we were going to do about it.
At that time, it was 9:30 AM EDT and people were starting their day. We were getting more and more emails and tweets asking what is going on and where are their notes. A few customers were angry, but most were understanding and supportive.
At 9:55 AM EDT, DO finally did the live migration a few minutes before the time limit we gave them, which was 10 AM. That was the end of the incident; CPU steal was down to < 1% and Dynalist was finally up again.
However, we couldn't trust DO any more. This weekend we're migrating to a dedicated server provider which has phone and live chat support. DO is pretty good for spinning up a $5 box quickly to test something, but we learned the hard way we shouldn't rely on it.
Our postmortem post: https://talk.dynalist.io/t/2019-05-22-dynalist-outage-post-m...