Today's modern applications run on multiple tiers with multiple points of failure. As soon as one of those points goes down, the support team gets a flood of calls. Someone puts a notice on the company intranet, or perhaps sends an email. Then the debugging begins.
Application performance management, or APM, is a process that makes application performance public, even push-notifying support when it falls below certain thresholds.
What problem does APM solve?
By monitoring the application, we know where the failure is when it occurs -- whether it's in the Web server, the service-oriented tier, the database, or even the network. Not only can we notify the team when something is down or slow, but also send out a notice when there is a trend, when performance is degrading. That means the support team can prevent the problem, adding or shifting resources before it reaches critical.
The 'what' of APM
Traditional measures of software are about capacity, utilization and throughput: CPU, memory, disk and network. Monitoring all of them doesn't tell you anything about what the user is actually experiencing until there's a serious problem.
The reports will let you target the problem very early.
When the CPU goes beyond limits or you run out of disk, you'll notice two problems. First, the systems that fail are not tied to applications. A DBA who notices a problem won't know which applications may be affected. Second, the monitors aren't tied to the end user. It is very possible that the problem is application code running slowly, a poor sort algorithm or a loop within a loop that doesn't put a lot of demand on CPU, the network or disk.
The classic solution to this is something called Method-R Measurement. The simple point of Method-R is to measure the entire customer experience, end to end, or to get as close to that as possible.
Here's how to get started on APM, especially for Web applications.
Begin with what you have
Most Web applications produce log files. With a little work, the programmers can change the log files to spit out both the URL accessed and the total time to load that resource. That measurement does not include the time from leaving your corporate network, or the time to render on the client's machine; but it is an approximation for how long your software takes to load a Web page, end to end.
Once you have those times, you can send them to a server that aggregates and reports on performance, as Noah Sussman explains in this video.
It is also useful to monitor the number of 404 errors, which occur when people try to access pages that don't exist. In addition, you might want to report back database queries that result in "bad SQL" errors or unhandled exceptions that bubble up to the Web server. When there is a sudden spike in these sorts of errors, something is wrong. It could be an ill-considered use case, a security-penetration exploit or a release of bad code. The reports will let you target the problem very early.
Another way to summarize performance is to use a tool that aggregates logs, sort of like a log search engine. Some of these tools allow you to create reports of system performance. If you've timed the logs, then you can create reports that run in real time that show the percentage of Web pages that run slower than some acceptable threshold. Throw the reports on monitors all around the operations area, and when something goes wrong, people will notice.
Next: Create a dashboard
Once you've got the data, you'll want to summarize it. Programmers at Etsy Inc. use Graphite, an open source real-time graphing service. Run Graphite on a computer pointed to a database with performance information, and any computer on your network can create a report that changes in real-time based on the URL.
With a little work and a large monitor, you can create a dashboard and report key statistics on each application.
Plenty of formats and tools exist to create dashboards. The key is to make something pluggable so that creating new items is a trivial application, just add the data to the database and set the parameters of the graph. The Etsy example is a line graph, but red, yellow, green monitors are also common.
What about the real user experience, end-to-end?
Say your website relies heavily on images, a dozen images per page. Each image individuallyloads just quickly enough to not report as a problem on the dashboard, but, to the user, the interface is painfully awkward.
The reports we listed above help, but none of them will detect this issue.
To report on end-user experience, truly Method-R, we could do two things:
- Continuously run a set of tests in production that you expect to pass. For example, log in, look at some static pages -- or create a report with the same from/to date -- then log out, over and over again.
When each page is loaded, the software inserts a row into the database with the current time, time to load and URL. Once that information is in the database, we can use the dashboard to create a new kind of dashboard widget that shows actual end-to-end load times, Method-R style.
Putting it all together
We start by pulling log data, then putting that into a dashboard. Then add a little bit of business logic to notify us when the software's performance is trending worse. Then we create dashboard reports, perhaps projecting the data on a wall that anyone can see. We can add in some system performance measures -- disk, CPU, network -- to that dashboard, but tied to the application was our monitoring. As a final piece, we add in business logic to analyze trends and report on trends that are troubling.
My advice: Take APM a piece at a time. You might develop a long-term strategy, but work on each piece individually, so you can stop at the end of each step and have something a little easier to manage and to support.
The trek isn't easy, but having no customer complaints about a slow -- or down – system you weren't aware of, that's something worth striving for.