Ship it on Friday! On Friday deployments as a litmus test for operational excellence

Ship it on Friday! On Friday deployments as a litmus test for operational excellence

Tags
EngineeringCulture
Published
December 17, 2021

Story time

Not long ago, a colleague asked me if I was OK with them merging a major system change (that I'd also reviewed) on a Friday or waiting until the next week, since I was to be on-call that weekend.

I appreciated that question for two reasons

  • It shows consideration for the person or people who might have to deal with the impact of their change. It's no fun to spend time triaging an incident alert only to find out that it is tied to a change that could and should have been communicated proactively, thereby reducing response time.
  • As the existence of this post might suggest, I have strong feelings about Friday deployments 😅.

Philosophy

I believe that the willingness to merge and deploy code, especially major changes, on a Friday is a litmus test for operational and tech excellence. As software engineers, we are among the parties responsible for understanding how the software that we build operates in production, including the ways in which it might fail for real users.

One way to assume that responsibility is to do at least a few on-call rotations for system that we build. In my experience, being on-call for products, especially ones with users all around the world, rewires my brain to think more expansively about risk. Not only have I learned to think about risks that can stem from code, but also to think about operational risks and even changes to products that could better assist users when things are going wrong. One of the practices I’ve developed as a result is asking myself,

What risks do I need to mitigate to feel confident deploying this change on a Friday, what risks am I comfortable accepting, and who do I need to keep in the loop about those decisions?

Of course, this needs to be balanced with pragmatism. If there isn't clear value in getting something out to users a few days earlier, then it may not make sense to incur even the potentially limited risk of disrupting weekend of the person on call, not to mention disruption to users. That said, having Friday deployments as a viable, drama-free option is still an ideal to worth striving towards. Even if you opt not to enact the practice, going through the thought excise of how to get there should help to identify areas to tighten up.

The Litmus Test

So, instead of answering my coworker's question as asked, I shared my personal litmus test for whether or not I would be comfortable merging if the roles were reversed:

  • Would you merge it if you were on call?
    • This is a big one tied back to being considerate. If the answer is no, then maybe answering the next few questions on the list can help to works towards yes.
  • Is there anything that you'd want to change if you had a few more days that would increase confidence in the rollout?
    • For example, should we be adding more end-to-end, integration, or unit tests? Have there best manual test session to catch reasonably low-hanging bugs?
  • What is likely to go wrong and are there clear mitigation steps?
    • This is where run books are handy to provide a clear sense of where to look (dashboards, etc) and who to talk to (comm channels for systems upon which you depend), etc are handy
    • The the case of a system redesign or other change with no external dependencies, the best mitigation is just to rollback the change.

If you'd be willing to merge if you own time were at stake & have taken reasonable precautions to mitigate risk & have a good sense of how to handle things if the risk doesn't pan out, then congratulations! Your systems are likely operating well whether or not you choose to deploy. You're well positioned to smash that merge button on Friday and still retain your peace of mind all weekend long 🎉. If, on the other hand, you don't feel good about it, you not have a sense of what work there is to be done.

Wrap Up

For software engineers, there is a sweet spot of accountability somewhere after “it works on my machine 🤷🏾‍♂️” and before taking on a full on dev ops role (unless that is the role you want). By thinking about ways to make Friday deploys not a big deal, what we are doing is thinking about ways to elevate our engineering practices and shifting towards being proactive rather than reactive about managing risks.

Combining an attitude of proactivity about managing risks with thoughtful responses during and after incidents when risks do inevitably actualize leads to an environment of compounding improvements, which is the kind of environment you want to be in.