Peg McMahon, Justin Martin, Sprint Nextel “Death to Dashboards: Alarming, Performance Management Based on Variance, System Prioritization and Other Thoughts on Data Visualization”
ABSTRACT: When does the light on the executive dashboard turn from yellow to red? When do you order new hardware? Traditionally, these decisions are handled by setting thresholds — picking some number to use as an upper or lower limit. Thresholds might have worked well in the days of a handful of beloved systems. But for today’s complex environments, thresholding is not only painful to manage but conceptually bankrupt. Let’s talk about the problems with thresholds and dashboards and work to identify some practical alternatives. Vendors, put on your iron underwear and attend this session.
COMMENTS: The main part of this paper is just about what SEDS has been already providing and what has already been presented in my CMG papers since 2001:
“Alarming Based on Variance. The next step towards better performance monitoring is the use of a baseline approach to performance management. Using the power of statistics, the performance metrics can be analyzed to create upper and lower control limits based on the normal variance of the system’s performance. From this analysis, dynamic thresholds can be set based on the normal variance represented by the data. Implementing dynamic, variance based thresholds takes into account the system’s typical workload characteristics. Now, when a back up occurs in the middle of the night, as long as the same back up has occurred at the same time for the past several nights, the CPU threshold is not breached. In theory, an alarm will only occur when the system utilization is above or below a dynamic threshold which outlines the “normal” processing range of the system. This method of monitoring will provide a more refined approach to alarming as it will help to better identify actual performance issues. This is important when the analyst is responsible for monitoring many systems. However, when implemented across thousands of systems, there will likely be several that will have at least one hour which exceeds the variance threshold, and thus triggers alarms. When using a conventional dashboard, this improved level of monitoring creates the same problem as found earlier. How do you prioritize the order in which to resolve the performance issues? In today’s business environment the number of systems is increasing while there are fewer people to manage them. Each system has a unique impact on the business. Understanding a system’s business impact and addressing system performance issues in the correct priority will save a company significant dollars. Using a conventional stoplight dashboard for system performance management will often confuse and delay critical decision making. One way to help address the prioritization problem, using the performance variance data, could be to create a sorted list. This type of report would present the servers having the most performance variance appearing at the top. Using this list, cross-referenced with a list of systems prioritized by their business criticality would be one way to determine which problems need to be addressed first. This method is not intuitive since it requires the analyst to jog between reports. However, it is a way to use the available data in order to make the most business impacting decision. The problem in determining how to quickly prioritize system performance issues is not necessarily due to a lacking in performance data, but rather the lack of a way to properly visualize the performance data....”.
Also the paper presents another example of using a tree-map! Again, the way how tree-map can be used against performance metrics was shown in my and Lin Merritt CMG papers in 2004.
Amit Patel - “Software Performance Lifecycle at a Large National Bank”
COMMENTS: The paper shows some Statistical Process Control (SPC) technique usage. From Abstract: "… Learn how custom monitoring, Six Sigma techniques, performance testing, and daily production reports played an important role in identifying production issues…. "
To build the following control chart the “Minitab” statistical tool was used (http://www.minitab.com/)