No one wanted to be a developer working for Southwest Airlines during its 12-hour website outage. The airline was...
forced to cancel 2,300 flights, delaying 4,500 people and losing an estimated $54 million in revenue -- not to mention gaining a good bit of ill will and bad publicity.
Michael Butt, director of product marketing for Big Panda, based in Palo Alto, Calif., pulled together those statistics -- as well as statistics from other major website outages -- to make the point that organizations and their development teams are simply not thinking through a design strategy that would minimize outages.
His major point: Stop expecting to build something that won't fail; instead, build something that will fail in the least disruptive way.
"There continue to be more significant organizations caught off guard by having outages as severe as they are," he said. "But it's not surprising. All digital transformations are more complex today, and with more change, you should expect the potential for failure. But the disappointing thing is that organizations are not doing more to get ahead of these really major failures."
Big Panda, which provides automatic alerts for events like website outages, has obviously had a lot of experience with things going wrong, Butt said. From his perspective, it is an issue with the high degree of interdependency in most applications combined with the fact that companies don't "build in" resiliency. "What is the organization doing to create more resiliency?" he asked. "When we look at the most progressive web scale companies that build systems for a living, they don't build the entire thing 100% bulletproof. There has to be an expectation that there's going to be some amount of failure."
For developers and testers, that means working with business to decide exactly which parts of the application are completely mission-critical and must never fail, and then putting the time, energy and tech efforts into them. At the same time, though, it's just as important to decide which areas could be allowed to fail, Butt said. "A dev or test person can't just be handed something with the idea that 100% of it can't fail," he said. "That's just not realistic."
To build with this whole idea of resilient failure, Butt suggested beginning at the architecture level and trying to identify the "tier one" parts of an application that must function no matter what. (To have that conversation, it's useful to remind everyone of what website outages really mean to a business. Butt's data showed that just a 20-minute crash on the popular car service app Uber gave their direct competitor, Lyft, an enormous boost in traffic.)
Developers and testers need to keep asking questions to hone in on what matters the most. "What's the business side complaining about?" he said. "That's how you're going to find out what to spend your time on."
What comes next after the server is down?
How to prep for outages in the cloud
What we can learn from the Stock Exchange outage