Years ago, I worked for an organization that prided itself on its internal controls for software. Using a work...
order, developers had to get every box checked and then outline the steps to move the code to production. If there was a problem, systems would be down, and any "fix" would require a similar rigorous -- and documented -- process. That single-minded focus on reliability meant we had to batch changes together into projects and roll out less often. It became a vicious cycle.
But then came DevOps. I don't mean DevOps in the fancy CI/CD way, though. In this case, it's about dev and ops focused together on what matters, which, as is clear from the above example, cannot be reliability. It has to be software resilience.
A few years ago, my friend Noah Sussman suggested that instead of reliability, software systems should focus on resilience. Where reliability is focused on failure prevention, software resilience is more concerned that a single failure does not destroy the system.
Resilience in an e-commerce application
To understand this, look at Amazon's front page. Have you noticed it seems a bit like a group of boxes put together like Legos? There is a title bar with links to your profile and your orders. Underneath that, there are lists: "Fun gift ideas under $10," "Things you browsed recently," "Your recommendations" and so on.
Each of those containers is a combination of display code and a web service call. If the web service is down -- perhaps because a new change will roll out this very second -- then the box does not display. The Amazon homepage continues to function perfectly well, and it is likely a customer did not even notice the absence of the box. That is the classic definition of software resilience.
Combine that with an attempt to reduce mean time to recovery, and suddenly it seems we invest more energy in monitoring to find problems quickly and have a brand new emphasis on smaller changes across the industry.
Let's put that together to define software resilience:
- We need to no longer think about reliability.
- We must build systems that are resilient.
- To do that, our focus should be on smaller changes, which are more easily rolled back.
- When we deploy, we must roll out only a tiny part of a system at any given time.
- Our system must be designed so the failure of one component does not bring everything to a halt.
If we can accomplish that, we've found the prescription for software resilience. And now we can get to work -- without a work order.