This content is part of the Essential Guide: A DevOps primer: Start, improve and extend your DevOps teams

A complete beginner's guide to DevOps best practices

You can't just 'do' DevOps and hope to get it right. Expert Matthew Heusser takes us through all the steps required to make DevOps work for your company -- and make your life easier.

The term DevOps implies programmers working with operators and testers to automate things. What exactly to automate, and where to start, can be overwhelming.

Today we take a look at how to get started with DevOps as an activity, along with actual technologies and processes to adopt. The DevOps best practices path we suggest is based on improving feedback at every step: faster builds, faster deploys, faster recovery, faster notification of errors, faster server builds and faster notification of errors.

Most of the examples below assume software designed for a web browser. People building Windows and native mobile applications are probably not going to pursue continuous delivery, but they might still get value by building, testing and deploying more often than what the team is doing now.

Here are some DevOps best practices to keep in mind.

It all starts with mean time to recovery

The traditional measure for improvement was mean time between failures, or MTBF. Trying to fail less often is a fine approach in general. Yet something happens when we try to increase uptime from, say, 99.9% to 99.99%. Each extra "nine" adds only one-tenth to the uptime, while achieving that extra nine can easily double the project cost. At some point, the price increase for just one extra nine is not worth the investment.

DevOps teams tend to look at the price of uptime differently. Instead of trying to fail less often, they try to recover more quickly. The algebra runs something like this:

Risk Exposure equals (number of users exposed to problem) multiplied by (how terrible the problem is).

Number of users is correlated with time, so if the team can identify, find and fix the problem in one-tenth the time, they can have five times as many defects escape to production and still have less risk exposure. Better yet, the team could apply this idea to every step of the development process -- finding bugs in requirements and code quickly -- so they can be fixed cheaply.

The pieces of a better mean time to recovery (MTTR) are typically the build, deploy, notice and notify, and fix processes. Teams can be more or less advanced at each of these; one e-commerce project I worked on recently could perform a new build in about 15 minutes, but took about four hours to roll a change to production. The length of the entire loop from build to in production was about five hours.

Build and verify a build

How long does it take to create a build, deploy it to a staging environment, check it for problems and mark it ready for production? The build server here is the easy part; getting it to move to staging automatically can be a problem. Many teams have legacy systems where a change in one place could have unforeseen consequences, so they have a "regression test process" to find problems. An automated regression check might mean it takes an hour to bless a build; some human processes take weeks or months. In many cases, the team can see a massive improvement by writing better code -- so there are fewer errors -- while switching to a more effective method of human testing, such as sampling adjusted for risk.

Deploy to production

Once the deploy decision is made, how long does it take to actually get on production? Some legacy systems may have hard requirements for this -- systems need to be turned off, files need to be copied by FTP and coordination needs to happen on multiple machines. Yet even those steps as they exist can be scripted, automated and done by the technical staff at the push of the button, instead of requiring a ticket and a hand-off. Please note that might not be the best approach. Instead, the team might start with some percentage of the deploy, or, perhaps, develop a new architecture to make deploys more seamless.

Notice and notify

Once a bug escapes to production, how long does it take to be found? Again, this is something to measure by looking at the last handful of serious bugs, when the builds were deployed and when the bugs were reported in a way that could be fixed. "There is a problem with payments for some customers," for example, is not actionable feedback.

Using DevOps best practices, take a hard look at those bugs, and you might find some common elements. For example, the bugs might involve long delays of page loads or 500 or 404 errors on the server. Most teams pursuing DevOps try to add real-time monitoring -- through dashboards -- of elements that are leading indicators of problems. Email alerts of problems, monitoring server health and "report problems" links on the website are all ways of getting notice of problems as soon as possible.

The fix process

Once the bug is found, how long does it take to get a fix ready to build? This is often a human process. Some group needs to meet to promote the bug to serious, then another person can take that bug on to fix -- assuming they are allowed to add the bug to this sprint. In classic Scrum, the bug would be added to next sprint's backlog, adding as much as two weeks or more to the fix process that was accidental. That might be needed to make sure the team is not overwhelmed. In that case, "Why are so many bugs coming through that we need to triage them and delay them?" sounds like a reasonable question to ask.

To improve MTTR, take a look at the elements of the loop and find the element that can improve the most with the least effort. Then go after it, and all the other DevOps best practices.

Before we leave you, a word of caution.

Don't give up on mean time between failures

It's tempting to "buy and install" DevOps; just plug in automated builds, use virtualized servers that can deploy on command, add a dash of monitoring and call it done. This is an approach that can reduce recovery time and will encourage teams to do a great many deploys quickly.

Except those deploys will require a lot of fixes. The fixes will require fixes. At some point, using the mantra "move fast and break things," it is possible to actually break all the things. The stability of the system decays, and, eventually, each new change seems to only make things worse.

In order to deploy often, we have to have high first time quality -- that is, we want our software to be in good shape before it not only goes out, but at each step along the way. In order to keep up with the pace of accelerated delivery, applications need to have fewer bugs. That would mean the testers get to spend less time in bug tracking systems and retesting. Programmers need to get examples that are detailed enough that they will build the right thing and make fewer mistakes. All of this means reasonably low MTBF as a prerequisite to DevOps best practices, or, at least, something to develop in parallel. The time to shift is when you've hit a wall, when adding another level of reliability will cause an explosion of cost while making delivery slower.

Sometimes, the opportunity to increase reliability is there, but the team does not know how, because they lack the skills. Many modern development techniques like test-driven development or exploratory testing require skills development. Others, like a component architecture or continuous integration, require an investment of time and probably money before they will show benefits.

Getting started

To really utilize DevOps best practices, take a hard look at a handful of recent failures in production. Figure out both how far apart they are and how long they took to get fixed. Ask your team what the next step is in improvement (higher time between failures or lower time to recovery), what is the bottleneck in the process, and where the team could have the most improvement for the least effort. The solutions that come up that involve developers, testers and operations working together to automate the flow of the work and enable self-service -- getting rid of steps where you have to ask someone to move a file or check something by hand -- those are the DevOps steps.

Next, figure out what it will take to get there. Come up with epics for each release -- automating the build and deploy process, including creating virtual servers on-demand. Each epic should have clear results, such as "lower time for a deployed build to staging from three hours to under 15 minutes." Break the epics into stories and present the epics to management.

If you are management, the DevOps best practices process is a little easier. Discuss what is possible with the technical team in broad strokes, communicate the vision, come up with the epics collaboratively and then ask the team how they are going to accomplish the objective of each epic. Along the way, you might search for tools or training or support, and that's okay. Just start with what and then ask how in order to roll out DevOps step by step. The tricky part will be deciding how to fund the project when the busy business of production is calling your name. One way to do that is to dedicate some percentage of effort; one team I worked with dedicated 20% of the work effort to small projects that were important, yet were being drowned out by major company initiatives. Pick your percentage, be prepared to defend it and get going. 

Next Steps

What happens when DevOps and Agile meet in the cloud?

Get the 2016 DevOps report card

Here's why Bernie Sanders is like DevOps, or vice versa

Dig Deeper on Agile, DevOps and software development methodologies