In parts one and two of our interview with author Michael J. Sydor, we learned about the challenges and potentials for failure with APM initiatives and what organizations can to do help mitigate those risks. In this final part of the three-part series, we talk about APM implementation and organizational structure. Sydor wraps up by telling us what he thinks is the most important take away from his new book, APM best practices: Realizing Application Performance Management.
SSQ: Is the “implementation” of APM, the tasks associated with ensuring that performance monitoring is in place? Can you explain some of what takes place during this phase?
Sydor: I look at implementation from the perspective of a project manager -- what needs to be done and when, to keep the project on track. I actually break a deployment into at least three iterations. For a new team, physical deployment is the primary risk. You need to keep the initial phase short and reliable. This is both to test your deployment mechanism as well as allow for operational use to confirm that everybody is getting what they thought they were getting. There is nothing more frustrating than doing 100% of a deployment, only to find out that a ‘late-requirement’ was missing, which ends up forcing a redeployment of a small configuration change to the 100% of the environment. So I do 10% of the final goal, operate for two weeks (or more) and catch those late-requirements for the following phase. I’ll also keep alerting to a minimum to allow confirmation of thresholds. Nobody appreciates an alert storm due to a configuration error. The second phase is 30% of the final. Any wrinkles in the deployment process are ironed out. An additional increment of alerting or other integration is established, followed by a second operational period. The third deployment phase of 60% should go like silk.
The goals in this phased deployment model are simple. Don’t over-commit the deployment footprint. Give yourself some opportunity for ‘late-requirements.’ Give yourself some time to learn all the steps necessary for a reliable deployment. And an important, subtle, point: make sure those first 10% are not problematic applications. Give yourself a chance to practice deployment so that if a problem pops up -- you know if it’s your process or a problematic application.
SSQ: What skills do you believe are important for those who will be APM “practitioners”?
Sydor: “Practitioners” is pretty broad as I allow for administrators, project managers, application specialist, architect, evangelist and APM specialist. The details of these roles are covered in Chapter 4 so I’ll just focus on the APM specialist here.
The best candidate has three major capabilities: managing expectations, communication remediation and all the other APM activities. Much of what the APM best practices do is to distribute the on-going activities to all the other stakeholders so that the APM specialist is focused on mentoring and triage (or firefighting). Triage is, of course, what everyone assumes they are getting with APM. But it is people that do triage; the tool is just that: a tool. It gives you the data and relationships among components. Somebody needs to interpret it. The persons doing this role need to be skilled communicators who can balance being pulled by multiple, competing agendas. You need to be a diplomat and avoid calling someone’s baby (the application) ugly! You really can train anyone to do the various APM tasks but you need a confident personality to stand up in front of a team in crisis, and lead them to a consensus on how to proceed, based on your findings. I’ve got a chapter on “Firefighting and critical situations” to help you establish that capability.
SSQ: Organizationally, do you recommend that performance monitoring be a part of DevOps? Does the Practitioner Guide include best practices around both detecting and addressing performance issues?
Sydor: Where the APM practices sits organizationally can be quite varied. More often it seems to end up reporting to operations. It depends on whether the initiative is starting in production, or development, or QA. But detecting problems is not limited to the production experience. You can’t be proactive unless you detect problems prior to production. This means that you need techniques for detecting problems with simulated (load generation), as well as live data. In fact, the techniques are identical except for the volume of the data needed to support a conclusion.
Simulated data offers the best control of the results. You can isolate specific use cases. You can load to failure. You can repeat scenarios to confirm your findings.
Live data can be very difficult to find consistent intervals on which to base a conclusion. That’s the nature of ‘live’ data. It is all but predictable -- until you have a couple of days to compare. The first question in triage is always “what’s normal?” Until you know what that is, you really can’t be sure what ‘abnormal’ is going to be. Often it is not realistic to be able to know what ‘normal’ is, so I have techniques (Triage with Single Metrics) specifically to address this. Doing a good job with single metrics takes experience and knowledge about how component-based software systems work. When you have some time, more often in a testing environment, then you can follow “Triage with Baselines.” This technique requires only basic problem solving because it leads you directly to the important components, or signature, for the application, even if you think Java is a choice between French Vanilla or Columbian. The signature, or baseline, is the definition of normal for that app. When you compare this with the production incident, after generating a comparable baseline there, it becomes pretty easy to find out what is contributing to your performance problem.
And there are variations on these techniques but the goal is always a deep understanding, or characterization, of the application at hand -- no matter what environment it happens to be in.
SSQ: Who is the primary audience for this book and what do you think their most important takeaway will be?
Sydor: I’m really trying to grab the attention of anyone who participates in the application lifecycle development, testing, operations and the business sponsor/owner. This goes back to the idea of using APM to drive collaboration among the various stakeholders. Everyone needs to know the capabilities of the technology and how their role contributes to a reliable performance management strategy. Otherwise, the tool gets stuck in a silo and the total potential value proposition for APM is difficult to achieve.
APM itself has a lifecycle of adoption, deployment and overall utility to the organization. As the organization matures in its performance management capabilities there will be brief windows of opportunity to advance an APM initiative. These windows can pop up anywhere in the application lifecycle. The reader will know what those windows are and how to articulate a message and plan that will be able to capitalize on that opportunity. They will know what value APM can deliver, what they need to get started and how to get it done. Follow these APM Best Practices, and you will get APM done right.
Michael Sydor is an engineering services architect for CA Technologies. With more than 20 years in the mastery of high performance computing technology, he has significant experience identifying and documenting technology best practices, as well as designing programs for building, mentoring and operating successful client performance management teams.
APM Best Practices: Realizing Application Performance Management is available at many leading book retailers, including Apress, Amazon.com, Barnes & Noble, Borders, Powell's, Safari, and Springer, among others.