Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why we don't schedule deployments during off-hours (scoutapp.com)
85 points by joshuacc on Oct 21, 2010 | hide | past | favorite | 54 comments


Our deployment process was always at night (10,000+ user org).

Even if your process is great and there is no downtime during upgrades, any problem might cause the whole organization to loose thousands of work hours (instead of just your team working during the night).

If your deploying a "nice-to-have" app, its all fine.

Otherwise, get to work late, order a pizza and prepare for a night of adventure.


The better approach is to figure out where the risk in pushing updates is coming from and develop tools to mitigate that. More testing, pseudo-production deployment infrastructure, better change management, more frequent deployments, better understanding of the changes going out with big deployments, etc.

The biggest companies in the world (amazon, google, etc.) who have literally millions of dollars an hour/minute/second at risk if they screw up deployments put in place all of these tools so they can push out deployments at peak times and be confident that they aren't going to destroy their business on accident.


When I was at Google working on web search, we didn't push search infrastructure during peak hours. I think this was mostly due wanting to make sure we had spare serving capacity to failover to, in case the push was broken/had performance problems/crashed/etc. Note that "peak hours" isn't the same as "working hours". Just a few hours when we saw our highest traffic.

However, we also had a strict policy of no pushing in the middle of the night, Fridays, or weekends, for all the reasons the article listed. Most problems on a relatively mature system occur when you're changing stuff. When stuff breaks, you want to have the people involved be fresh and thinking clearly. You also want everyone who might be needed to fix something available, in the office and in front of their computers if something goes wrong. Lastly, it's not sustainable to have your engineers/operations folks regularly giving up their nights and weekends for routine operations. You should save that karma for when everything goes to hell.

This is all predicated on having pushes not negatively affect production traffic when everything goes according to plan, and minimize the impact as much as possible when something does break. One of the best ways I know to do that is to increase the amount of traffic proportional to increases in your confidence that the change is good. The idea is that any push starts off by only affecting a small part of your infrastructure/traffic, and the longer you run it without problems, the more traffic/infrastructure you push it to.

At Shopkick, we do this by pushing first to a single machine, watching it for any errors for a few minutes, then gradually pushing it to the rest of our machines. If anything goes wrong, only a portion of traffic was affected, and the portion is well correlated to the probability of having a problem.


Conditional deployment with the ability to control the level of deployment incrementally is a hugely powerful tool. Twitter and amazon both use them, at least for higher risk changes. The ability to "roll back" a deployment completely at the press of a button with a configuration change with no actual running bits changing is very potent.


This is a cost-value calculation and neither approach is "best" at all cases. Does the cost of developing the tools and procedures exceed the cost of planned downtimes during the next X years?

Often you build some tools that mitigate some of the risk and plan deployments at a less active hour to mitigate the rest of it.


> prepare to [have] a night of adventure.

This really put a smile on my face.

That's the best approach to take with anything, but it's especially applicable to technical stuff that's complex and hard to predict.

I've wasted a fair amount of time being pissy and figured out this: with computers, as in life, the only thing over which you have complete control is your attitude. Plan to have fun and learn things and even the sub-optimal can be a good time.


> prepare to [have] a night of adventure.

This is exactly why we don't deploy at night. In what other line of work do you schedule complex work when you're going to be tired?


I really hope they don't start fixing roads during the day.. you know.. so the workers are not tired.

Or major electrical upgrades.. or phones.. or local internet backbone.. or water.. or what ever other utility that is more important to my life than github (which was down today)

How hard is it to work at night one time ??


The big point seems to be "Many web applications, including Scout, have customers around the world. There isn’t a perfect time for a deployment."

It is like a road stuck in a permanent rush hour -- if the time of repair has little effect on the users annoyance, then it's a good idea to pick the time to maximize your performance.

Working at night isn't hard in the meaning "I don't know if I can pull that off," but in the "I don't know how well will I pull that off" meaning. Anyone can pull an all-nighter or two, but most people will make more errors in the night than during the day.


For roads it's almost impossible to avoid impacting use during maintenance, the same isn't, or shouldn't, be true with software.


1) It takes a tough man to make a tender chicken.

2) You should know what you're going to do ahead of time, and you should have done tests beforehand. The after-hours work isn't that complex if you're doing it right.


1) It takes a tough man to make a tender chicken. 2) You should know what you're going to do ahead of time, and you should have done tests beforehand. The after-hours work isn't that complex if you're doing it right.

1. you have no idea how many systems have been utterly destroyed by "tough men" working beyond their limits. I have seen this many places I've worked as a SysAdmin... it's like a mind disease, and it's not any good for anyone.

2. You need to prepare ahead of time, and you need an easy way to roll back. but you /always/ need to be prepared just in case something goes wrong.

This is the other classic mistake: Make the new guy stay up all night handling the upgrade. I mean, the senior guys have tested it, right? It ends in tears for all involved. the new guy is the least able to deal with the stresses of doing the upgrade (and the rollback, and even if you test the upgrade, nobody tests the rollback)

Do your upgrades when your senior people are around, preferably towards the beginning of their shift. If everything goes as planned, it won't matter when you do it, because downtime will be minimal either way. If, however, it throws up on your shoes, I don't know about you, but I am significantly more effective at dealing with problems when I'm fresh. And yeah, that means I'm a wimp. But when you get down to it, most of us are.


Medicine? (Although your hand is usually forced into doing night procedures--it's not so much a choice, I agree.)


I missed some context there. I thought "yes, I need medicine after night work". Why did I completely miss the mark? Swing shifts all week.


Heh, I meant the field of medicine. But I can understand your perspective :)


The trouble with that is the deployment is expected to be simple and under your control.

The real fun bit comes when users get at it. By doing the deployment at night you find out if it worked at 9:00am the next morning when 1000s of employees connect try and do some work. If there are any issues you are then in deep xxxx

By doing deployment in the afternoon, people are offline for the hour or so it takes to deploy but then you have all night to fix any issues.


Deploying new code should never be an "adventure". It should be a simple, replicable process that allows you to safely stop or rollback at any given time. It should have enough checks in place that the impact from any given mistake is minimized. There should never be any "oh shit" moments.


There will always be "oh shit" moments, no matter how much you think there won't. Yeah, you can mitigate alot of them, but things get through. Even using deploys which can be rolled back, frozen dependencies or bundler for us rails folk, shit can and will happen.


You're right - I should be more clear what I mean. "Oh shit" moments caused by a change you made due to pushing a new piece of code should never last more than a minute or two, between the time you realize something went wrong and the time you run the command that reverts things back to the known good state. At Shopkick, this usually means running a command which flips a symlink that points at the current code/config to use and restarts/reloads the server in question.


I work for a company that provides a mission critical service to a lot of people. Maybe we're just spoiled, but we have multiple failover environments. That means that we can deploy in the afternoon by failing the service over to a hot backup during the deployment. When the deploy is finished, we fail the backup system to the production environment and we deploy to the backups while the primary is running. Since there's 0 downtime, we like to work through the process in the morning/afternoons. No all-nighters here, and its definitely not a nice-to-have service. It's all about architecture and design.


> Our deployment process was always at night (10,000+ user org).

You didn't even get to the second paragraph of the article? Really?

> Many web applications, including Scout, have customers around the world. There isn’t a perfect time for a deployment.


FWIW, many 10k+ user organizations are "around the world" as well.

Web apps for server cluster admins have customer bases evenly distributed across every time zone on the planet? Really?

There may not be a perfect time, but there are almost certainly better and worse times.


Maybe the differences are not big enough to justify increasing the probability of mistakes.


Granted. This is a much more reasonable response to the top of this thread than sanctimonious quoting.

See also scottyallen's comment here (http://news.ycombinator.com/item?id=1817378). If Google web search has (or recently had) such a thing as "peak hours," I'll hazard a guess that <your web app> does too.


It's been many years since I worked in a shop that didn't deploy during the day (large financial institution aside), and for all the reasons the post mentions I completely agree.

What enables a mid-day deploy, however, is a well thought-out deploy process that aims for a zero downtime deployment. This can be especially complicated when you have relational database changes baked into a deploy.

It's interesting to see some folks suggesting that they may not be able to do this during business hours. Here's the thing: when bad things happen, you need to deploy ASAP. Developing a great deploy process that allows your system to remain standing during that deployment has numerous advantages, including the ability to deploy at will, and have no fear in doing so.


Relational DB changes don't have to be particular show-stoppers, provided you're willing to accept a few constraints:

1. No select * in SQL; name your columns in selects and inserts (prevents the select result set from changing or the insert from breaking when you add a column later)

2. All tables need table aliases in join queries. (Otherwise, adding a column can cause a naming collision and introduce ambiguity into a presently non-ambiguous select.)

3. Transactional columns can be added, provided they have defaults or are nullable

4. Columns can only be dropped in two releases, one to remove all references to them and the second to actually remove. (You may want to rename the column first during the second release and drop it a few minutes/hours later once you're sure no one is relying on it.)

5. Modifying a column requires some care (to insure that you don't lock the table too long during the modification). Here again, you can probably add a column of a different name, gradually populate it, then rename into place later, add a new table and left outer join to that, or wait for a future downtime event (rarely required in practice).

It's not necessarily "trivial" to accomplish, but it's also not deep, black magic.


This is a good guide, but note that he said zero downtime deployment

That imposes an additional constraint - the new code needs to be backwards compatible (while any data migration scripts are running), and - in a clustered environment - it needs to be able to deal with the case where old and new code bases and databases are available and running simultaneously.


If you release the new code only after the DB changes are complete, the new code doesn't need to be backwards compatible.

The above guidelines are designed to provide no-change to the queries the old code is running, so that should be possible to accomplish for 95+% of your migrations.

For truly long-running migrations (longer than you're willing to wait to push code), you're correct that the new code must handle both cases. In our case, that's extremely rarely a factor, but we also don't push on every commit (nor even every day) like some shops do.


As others have mentioned, this really only works if you have a simple (and solid) deploy and rollback practice. Most people don't.

Given that, all it takes is one terrible deploy: shit goes way south and you're down for most of the day. Now you have the CEO standing around your desk nervously looking at you trying to re-federate the databases when you to realize a 7PM deploy is not all that bad.

>Deploying a major update when I’m not in work-mode is awkward as well.

I'm sorry, no. I love my family and time, too... But logging in from home for an hour or two in the evening to do a deploy and validation isn't that much to ask, especially considering most of our salaries. If you're that sensitive about it, come in late the next morning.


If you've got a CEO that

1. Think's he's helping by standing around your desk

and

2. will happily lurk over you while you're fixing things during working hours but evidently isn't concerned, not feeling as "helpful", nor around at 7PM...

Then, you've got other problems, I'd say.


barclay: You obviously missed the point that "with customers around the world. There isn’t a perfect time for a deployment".

Come on everyone, we've got to get passed this US-centric world-view, especially when it comes to web apps. If you're down, you're down. It's bad no matter what time of day. Maybe it feels worse and is more stressful when it's during the day because the CEO is in the office standing over your shoulder, but that doesn't make deploying at night a substitute for bad deployment environments or planning.


Not many companies have a customer-base that is spread evenly through the world. Not having a "perfect time" doesn't mean there isn't a "better time" than 9 AM EST.


sounds like the issue here actually lies in the suggestion that most people don't have a sufficient deployment mechanism.


Yup that is usually the problem, when deploys are the exception rather than the rule you are looking for problems.


Got it in one...


I do the same with minor exception.

If the busiest time is during working hours you maybe want to avoid it if you are taking in money during that time.

Another good reason for doing maintenance during working hours is that's when your network admins are also working - or anyone else outside your group whose help you might need.


What are off-hours? This is the Internet.

We used to have off-peak times, but even those days are mostly gone now.

Anyways, I agree with the spirit of this article. Best to do pushes when everyone is around or paying attention and you are refreshed, in case you just unleashed a disaster. :-)


While our web applications get around-the-clock use during the business week, there is plenty of downtime on the weekends. Really disruptive stuff (generally hardware or OS-level changes) is done on the weekends, but normal deployments and updates are usually scheduled in the morning, for the same reasons outlined in the original post.


This article is a little disappointing. The writer brings up the obvious advantages to it, but doesn't say how he manages the technical risk of the activity (unplanned downtime).


The best approach is to design and build a system so that you can make updates to the live system with minimal impact: no need to stop and start servers, risk unaffected components being down, etc.

Of course a lot of this depends on your framework, tools, etc., but a little forethought can go a long way.


My addendum to this rule is I never deploy new features a week before I go on vacation.


I like to have a web app launch the upgrade script. That way I can log in from my fone and start the upgrade right before I get on the plane. Whats the point in having minions if they can't finish the details for you?


I think launching on weekends (if your usage is lower then) is a viable alternative. You can be fresh compared to doing stuff in the middle of the night and you can give everyone a day off during the week to make up for it.


My body and mind have taken a beating I don't think I can do much longer from ten years of useless night work swinging shifts sometimes three times a week, that guy is too happy.

That's a great list of personal reasons, but business can come first for the same reasons. My last big deployment was to half a million set top boxes. In the end, I had to do it at night, but I had oh so many reasons for doing it during the day:

* What if a percentage of the boxes bricked halfway through the deployment? A statistical insignificance in the lab doesn't help me help 15k people with bricks on their TV's. When they start playing with the boxes, I'm going to have to roll trucks and that's expensive. Instead man up the call center during a time where we don't have to pay shift diff.

* From an end-user standpoint, there's a different set of eyes on the products at different times of day. Think about it this way, if you sell porn, how much are you selling at 2AM vs. 9AM do you think? They're not going to call and complain about their porno not working, but it doesn't mean there's not a huge level of dissatisfaction. There's a lot of general economics and trade offs involved.

* Top notch help has the luxury of sleeping at this hour because they've probably earned the tenure. I'm not going to get the vendor of a vendor to help me with this stuff at 3am, and they're definitely not going to be fresh and chipper, and getting anyone beyond support on the phone takes hours -- by that time, they're awake anyway.

Change the above as you see fit to adapt to your type of work, it's all the same.


I think it really depends on the market too. If you're deploying something that businesses use every day, taking it down during business hours is a bit of a problem. If it's something I pay for and use personally, I would probably prefer it be down while I'm at work and not 5 minutes after I get home.


Are more businesses moving towards deployments that have zero downtime?

The company I work for does not, but it seems like a lot of others are using various strategies to bring up a new version of an application and then transition users over to that, rather than bringing down the old one and then bringing up and new one.


I work in Ireland for a company whose primary market is the US which, as far as builds go, is the best you could ask for!

I do builds around 9-10am GMT after a good night's sleep when I'm nice and alert (and have had a cup or two of coffee). All the while, (most of) our target market is fast asleep in the US.


I believe in doing deployments for enterprise apps on Saturday or Sunday AM; everyone is still fresh vs. sleep deprived, and if something goes wrong, you have a long time to fix it. Plus, minimal impact on the customers.

Another option is using the more-bogus holidays (Presidents' day?), or religious holidays you might not celebrate (Easter isn't that secular, so non-Christians can give it up).

The trade for staff is that you get comp time to use later, and usually at some multiple. I love working on Christmas for 4 comp days later.

Working a weekend and getting 2-3 days off mid-week is awesome; not only do you get more done while working uninterrupted on a weekend, taking a mini-vacation during the off days is very cost effective. $49 suites in Vegas, woo!


A good approach I've used before is:

- no Friday deployments - deployments run primarily by a single person should be done early in their work day - avoid Monday deployments ... people are usually fuzzier on Monday than they are on Tuesday, after having had a day to go through the weekend's email and get their heads back into work.


I'm a big advocate of gradual deployment, both in terms of features (limiting your changelog) as towards userbase (not deploying it for everybody at once if it's a sensitive update). Not really too concerned anymore about what time of the day, I think OP has a few valid points there.


I would prefer this, but I currently have an edu gig and doing anything while students are in class is a serious recipe for disaster.


I schedule deployments during off hours.

The key is to wake up at either 4 pm or 3 am.


Wait, did someone think I wasn't serious?


launching at off-hours or friday night is a way to the boss to abuse young kids




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: