Poorly designed environments and QA automation scripts impede digital transformation. In this podcast, we discuss how to stabilize a quality signal and implement continuous improvement.
It's easy to think of test automation simply as a means to remove time- and labor-intensive QA tasks. However, organizations often fail to realize the full benefits of QA automation and miss the chance to fundamentally transform their development processes to build in quality from the start.
Gary Gruver made that argument in his book Engineering the Digital Transformation. Through his consulting work, Gruver found, time and time again, that enterprises that embrace test automation often wait to do all of their QA work later in the development cycle, just as they did when those tests were manual. In this episode of the Test & Release podcast, Gruver argues that enterprises must ask themselves, "How do we use that new capability to completely transform how the organization works?"
In addition to his advice on QA automation, Gruver explains the importance of stable quality signals and how organizations can take steps to eliminate failing and flaky tests.
He also gives listeners advice on how to instill more "systematic rigor" into continuous improvement efforts, as well as why an organization should make its business goals more apparent to developers and testers. When presented with clear business metrics that support the intent behind an app, organizations can reduce time waste and ensure they're creating software that will actually see the light of day.
"What we found in the industry is, a lot of times, 50% of what we've developed is either never used or doesn't meet the business intent," Gruver says. "So, if we're going to do continuous improvement of the product, we need to be much clearer about what the business intent of the product is, and we need to do a much better job of measuring which features are being used."
Editor's note: Gruver spoke with site editor David Carty and assistant site editor Ryan Black. The transcript has been lightly edited for clarity and brevity.
First, where your book is all about digital transformation, which is a big, nebulous effort, can you tell me maybe three key areas where you see organizations struggle with digital transformation?
Gary Gruver: You know, when I think of digital transformation, a lot of it is, yeah, they're trying to transform digitally. But you're also trying to transform how you do things digitally. And this isn't so much about what to transform. It's more about: How do you get more effective in terms of transformation? Or how do you get more effective in terms of software development delivery? Because there's no shortage of good ideas. There's just really a shortage of people being able to get things done fast enough, quick enough, efficiently enough for the dollars they have, and a lot of organizations struggle with that. There's a lot of different ideas out there from the Agile community, the DevOps community. They say, 'Do these practices, and you'll get better.' What I found is that I'm in a somewhat unique industry in that I'm a thought leader, but also, when I work with organizations, I don't work with them unless I can stay with them through the journey and help coach them along the way. That gives me a couple of advantages.
One is I don't know exactly what to tell them up until they start running into problems. Two is that I get to learn from them along that journey. So, I get to see what's working and what's not working. What I found is, if people just go off and do what works for somebody else, without understanding their unique challenges and issues, they're not really going to see the benefits that are possible, and they're going to lose momentum in transforming how they do software development [and] delivery. This is trying to provide a systematic approach for going in and analyzing your unique challenges: Being able to prioritize it, being able to get everybody in your organization on the same page so that you know the types of things that you want to go fix, and then being able to measure the impact that a change has so you can have momentum and continue to get people to commit resources to improving over time.
You mentioned your work with clients. I had a question to that effect. I was wondering: What sort of common software quality mistakes were you seeing among your clients, and how did that inform the realization that you came to, the one that was a much more systematic approach to software quality was necessary?
Gruver: I saw a lot of people struggling. If you look at just the software quality vector of this, which is a pretty thin slice of the book, what I see is: People did test automation, but they really had to transform how they did work. And I think the really classic example of this is [Eliyahu] Goldratt's [book], Beyond the Goal. ...He pulls it out of one of the first things computers did, [which] was MRP systems, which is manufacturing resource planning. And it used to be, in a factory, that you'd have 300 people [and] would have 40 people just to do the planning, which is what parts should we order, when should we build stuff, when should we ship stuff, based on demand and inventory and everything else. When computers came out in the '80s, that was one of the first applications. Black & Decker automated that whole process and enabled them to run it on a more frequent basis. They ran with huge amounts of lower inventory, they had better availability and they were just dominating the market. So, everybody else went out and started doing MRP systems, and they didn't see any benefits.
What you ran into is, over time, what you found when Goldratt did the research, [there was] an inherent rule in the system that hadn't been fundamentally changed. That was: Everybody ran, in the old system, the MRP planning once a month, because it was expensive and hard and took a lot of time. When they automated, it was no longer expensive and hard and didn't take a lot of time, but they had that inherent rule that they just didn't run it very often. So, as organizations started automating their testing, what you found is they still wait until they get to some phase of the development process -- call it 'development complete' or something else. Then, they put all the code together, and they start inspecting quality. So, what they've done is they've automated basically what MRP did. It automated the planning process; they've automated the testing process. But what Black & Decker did differently is they ran it several times a week instead of once a month. So, they took that new capability to fundamentally transform how the organization works.
What I found is a lot of organizations really don't take that step of fundamentally changing how the organization works with test automation. They're still running test automation the same way they did with the same rules -- that [they] wait and do it later. The big change that we need to do is we need to start shifting to: How do we use that new capability to completely transform how the organization works, like Black & Decker did, instead of just automating the manual process to make that process more efficient? That's one of the biggest mistakes that I see. The other challenge is, when people start to do that and start to build in quality, what I find is they do that before they have a stable quality signal, and it just creates a lot of chaos in the process.
I think that's an interesting idea. You mentioned the idea of a quality signal in the book and in developing a stable quality signal. Could you explain a little bit more about what that means and some of the inherent challenges that go into that?
Gruver: When you run your test automation, a lot of times, you will find stuff that's not a code issue. You will run into an environment problem, you will run into a deployment problem, you run into a test problem, you run into all sorts of other things. If you expect people to be responding to this signal but you have a bunch of things that are outside of their control, they'll find that they're wasting time going through the whole triage debugging process, and they can't keep up. They can't do it.
The other thing is, when you really start to build in quality, you're going to be running your testing 10 times, 100 times more frequently than you do when it's manual. If you have any flakiness in that system, you're just going to get bogged down and file false failures. I've seen a lot of organizations try this; they look at trying to do DevOps, they say, 'Well, we're going to focus on red builds; we're going to do continuous integration.' And, as they start to do that, if they haven't taken the time to really ensure that they have a stable quality signal, their transformation is going to bog down. People are going to give up. It's no longer worth trying. They really need to back up and change. Having watched a bunch of different organizations run into this for different reasons, one of the real motivations for writing the book is saying, 'No, no. We need to step back and take a much more systematic approach and ensure that we have a stable quality signal.'
So, the first step is: If you've got a good set of automated tests, you've written some, you put some together, pick a spot on your deployment pipeline -- ideally, as far to the right, as close to production as you can -- that you have control over and influence that you can change it. For some people, it's a small team that may be your build system, but for leaders of organizations that really have control over the broader system, the further right you look and the further you understand the issues that are hitting there, the more impact that improvement has because [there are] more people that have code in those environments they're trying to affect.
When you're saying, 'work as far to the right as possible,' are you talking about shift-right testing, then?
Gruver: Well, I'm talking about the deployment pipeline. So, the framework that I really think about [in] the deployment pipeline is how we check in code, how we build the code, how we create an environment, how we deploy into that environment, how do we run a test and how do we make sure it's ready for release. Jez Humble and David Farley came out with that in their [Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation] book a decade or so ago. That framework, I think, is how we manufacture software, and we need to make that visible.
So, if you look at some of the earlier chapters of the book, the first step is really making that factory visible. The idea of looking at where the steps -- all the different environments that you go through and all the different stuff that you go through, from code being written to being ready for production -- and if you go further to the right on that, you tend to have more code, and you tend to have more people influences because you've got a bunch of different subcomponents that get aggregated together.
When I say, 'to the right,' I'm saying the closer you get to production, the more people are impacted by instability in that system. Ideally, you really want to shift left with your testing because you want to find a defect as close to the source as you possibly can, that has the fewest number of commits. Does that help clarify that?
It does. Yeah, thank you. I was wondering if you could also lay out maybe how a team could go about changing its processes -- something you mentioned is of great importance in the approach you're advocating for -- but also go about ensuring a stable deployment pipeline.
Gruver: So, take an environment, and take your automated tests. Run them 20 times in a row, and stick the results in a database. What I found is that you would expect, if you ran the same test in the same environment on the same code 20 times in a row, you'd always get the same answer. The reality that I see in most organizations is you have some tests that always pass; you'll have some tests that always fail because there's a defect in the system. But, more and more, organizations have tests that will toggle between pass and fail. If you've got those in the system and you're expecting developers to respond to it, they're going to get frustrated and give up because it's a flaky signal. They can't trust it, they can't rely on it and it's not related to the code that they checked in. Once you've segmented and take those tests and set them aside until you get reworked, set aside the tests that are always failing, and set them aside until you get the defect fixed.
Now, take your test that you know passed 20 times in a row, and now run them in random order, and see if you can get them to pass. What happens is, frequently, those tests, when you do them in random order, won't pass because there's some sort of order dependency -- a test run before it will set up the data in some way that the next test is dependent upon. If you're going to run a bunch of automated tests in parallel, you can't have that order dependency. So, run those 20 times in a row, and then take the ones that are flaky that toggle between pass and fail, and move those into the stack of flaky tests to be reworked later.
Then, take the tests that are running but are always passing past 20 times in a row. Now, we're going to take those tests, and we're going to run them in parallel against the environment and put that environment under load. What you frequently find is that organizations that had been doing manual testing against these environments, when you load them up with a lot of test automation all at once, they can't handle the load. [It] may be that the environments [are] too small. It may be you've got a flaky F5 in the system. You may have a flaky connection to your build server, your Git repository or something like that, [which] causes that not to go well and for your test to time out. Take the time to get that to root cause, drive that into resolution, drive any of those issues out of the system because, if you expect people to respond to it and you've got flakiness in the system, people are going to give up. [They will] disengage; it's going to go bad.
Now, if you've got that solid, you've got your environment to where it can handle a load, the next thing that we want to do is make sure that our deployment process is stable. Do the deployment, run all your tests, keep those results. Do another deployment. And this is in the same environment. We're just testing our deployment process to see if we can repeatedly and reliably deploy the code consistently. If you have a problem there, then you need to automate your deployment process. You need to clean it up; you need to work on that stability. Then, we're going to do the same thing for creating an environment. We're going to create an environment. We're going to deploy the code. We're going to run our automated tests.
When you get that stable, then you know you've got a stable quality signal, and you can use that to really start changing how your business responds. It would be like if the MRP people wanted to fundamentally transform and started running their tests on an ongoing basis or running the planning multiple times a week. But, [if] the planning was inaccurate, it wouldn't be very effective. So, before we can fundamentally change anything about how we do software development delivery, one of the biggest changes is this ability to build in quality. But you can't expect your developers to build in quality if you've got a flaky signal.
I wanted to take that to the next step, too, and talk about continuous improvement, which is another element of quality. Later on in your book, you say there's a need for the industry to create a 'culture of continuous improvement' for both the process and the product. How can organizations carve out time and mental energy to focus on continuous improvement, especially as so many of these organizations feel the pace of development quicken?
Gruver: This is probably one of the biggest challenges out there. What happens in manufacturing is you have one organization that's responsible for designing the product and another organization that's responsible for building in quality and optimizing throughput and flow through the organization. So, the manufacturing part of the organization is really focused on continuous improvement, and the development part of the organization is focused on building it once and doing it right once. In software, the way we manufacture the product has a huge impact on the productivity of our development because, every time we make a code change, we're running through that manufacturing process. If we can keep the code close to releasable, we can become much more productive.
To do that, we really need to be able to show people that, if we spend time sharpening the saw, we're going to be able to get more done in the long run. Because what you're doing is: You're asking people to complete with features, with ideas for process improvement, and that's really hard for most organizations. I tend to start with the leaders of the organization, and we map out, and we go through all the steps defined in the book Engineering the Digital Transformation. [We] truly make sure that we're making our software development and our manufacturing process very visible. We're putting metrics on it that highlight and quantify the waste in the system. And, when we make a change to make an improvement, we have some metrics that we can quantify and show that, 'You know, David, this was really great. You gave me some money to invest in this, and let me show you what I got out of it.' In software, we tend to do things like: We'll go do Agile, or we'll go to DevOps, and we really don't have a way of going back to the people that we talked into helping us make that investment and showing them the return from that investment. If we're going to continue to get people to do it, first, we need to make the waste visible and show them how, if we eliminate this waste, they'll get more of what they want in the long term, which is business value. And, if you can't show them that, you can't quantify it, it's going to be really hard to get people to commit to it, unless you just happen to have a leader that's gone to an Agile conference or gone to a DevOps conference and came back and said, 'We're going to go do DevOps, or we're going to go do this.'
What I find most effective is you really need to get it quantified down to, 'Here's the stuff [that] is getting in our way. Here's how it's slowing us down. It's quantified with metrics; we can show exactly where it is. And we can prioritize the improvements that will have the biggest impact on our business and free up the most capacity for doing improvements.' Then, once we've done that, you've committed to let me make the change, I need to be able to go back to you and show you what I got out of it and why the organization is better. That's not something that we do very well in software because it's kind of hard to measure.
This is another idea that you get into in your book -- just how important it is to match applications to business intent. What do you think is the best way for the business side to better equip developers with information and metrics to achieve [those] goals?
Gruver: So, that's continuous improvement in the product. The interesting thing about software is, in manufacturing, you release a product ... and then you try to manufacture it efficiently. Software has got this unique capability where you can constantly learn and evolve with the product and change it. You're not worried about finished goods inventory, or channel inventory or changing the product in the field because everybody can update. That's very unique. So, we talked a little bit about how do you change the process and improve the process, but when you look at the product, there's an opportunity to do that. Jez Humble and Joanne [Molesky] and [Barry O'Reilly] came out with the Lean Enterprise book a while ago that said, really, what we need to be doing with the product is we need to be thinking about testing [a] hypothesis about what will make it more effective and what will make it more productive in the business.
What we find in a lot of organizations is the organization is just responding to a marketing person's or a high level executive's idea of, 'These are the features we gotta do; let's go do them.' What we found in the industry is, a lot of times, 50% of what we've developed is either never used or doesn't meet the business intent. So, if we're going to do continuous improvement of the product, we need to be much clearer about what the business intent of the product is, and we need to do a much better job of measuring which features are being used. Once you're meeting the business intent, we need to get much quicker iterations and feedback on that. Frequently, what we do is: We'll do large releases with what the marketing organization asked for, and we'll never go back and measure whether it had that impact.
In manufacturing, they've come up with a process, it's called the A3 process chart for process improvement. What I was hoping to put out there for people to start using is: Let's try to bring some of that systematic rigor to how we do product improvement for software. Let's get a chart where we capture all the business intent metrics because what we're going to do is not just engage a couple of thought leaders at the top to figure out how to continuously improve the product, but we're trying to engage the entire organization and thinking about how they can continually improve the product and the process. To do that, we need to make it much more visible about what we're trying to accomplish as an organization.
If we're a website and we've got a search team that's trying to find you the best product that you want to buy, you probably want to have the ability to say, 'Of the people that come to the website, how many people make it all the way to checkout?' That's an indicator that they're finding what they need. Maybe it's the number of steps that it took them to get there. What are those types of metrics that you want to track that you [are] really trying to accomplish with your software? The more you can make that visible to everybody in the organization and you can track not just, 'Did we deliver this new capability?' but, 'Did it have an impact on the business metric or the business intent that we're trying to accomplish as an organization?' Because we're not just trying to deliver as many features as we can, we're trying to deliver business value, and we need a metric that's visible in the organization that we're tracking to see if we've had that impact. And we need to figure out whether people are even using the metrics of the new features that we're putting out there. Not a lot of people measure those things. If we're going to engage the entire organization [in] continuous improvement, we need to need to make that visible.
We're just about running to the end of our time here, Gary, but I did want to circle back and ask you about test environments real quick. So, if you could recommend a few best practices for test environments -- particularly to help organizations create stable and optimal ones -- what would be some of the broad-level best practices you'd recommend?
Gruver: I would say your challenges are going to be unique to your applications and your deployment pipeline. Run experiments that I talked about to get to a stable quality signal, and address the issues that are keeping you from moving forward. I don't like, 'Gee, everybody does this, and it's going to fix [our] problems.' You need to go through and do that. A lot of organizations that start off during an experiment can't even run them because it's so hard to set up their data. One of the things that you'll find as you go do that process and work through it is you're going to run into unique challenges that are very specific to you that are going to take a while to resolve and deal with those specific issues. That's really why I came up with that test because it forces you to fix things. In a lot of cases, you'll find it's hard to set up the data for your tests. In a lot of cases, you'll find that, if you've got a large, tightly coupled system, there's a back-end system that some organization outside your group is responsible for that is unstable, or maybe they're deploying code while you're doing it, and you need to create service virtualization to isolate you from that instability. There's so many different things that are unique, that what I've come to the conclusion of, instead of saying, 'You should go do these things for your environment,' I would say, 'Go down and run that experiment, and run it over and over again until you can get it solved.'
If you look at the book, there's a case study of a large healthcare provider in the United States who went through that process and did it. It took them two to three years to get there, but at the end of it, when they got to the end of it, [they were] like, 'Wow, this is so cool. I fundamentally change how my whole software development process works. Now that I have the signal, I can make sure everybody's accountable, they can check in code, we can move so much faster.' The journey is to get people to realize that and feel that because, until you've worked in an environment like this, you can't ever imagine it working. Once you've worked in environments like this, you can't imagine working a different way. It's just fun for me to see when people finally get the lightbulbs go off. It's fundamentally changed things that we did at HP so dramatically. It forced me to write my first book and got me on this path. It was such a breakthrough for me that I spent the rest of my time and my career trying to help others avoid the mistakes I did and get there as quick as I can. That's why I do podcasts. That's why I do webinars. That's why I write books: to help others realize the breakthroughs that are possible.