System Management by Exception: CMG'07 trip report

Sunday, April 27, 2008

CMG'07 trip report

1. Statistical Process Control And Capacity Management (SEDS -like approach)

Igor Trubin, Ray White IBM “System Management by Exception: The Final Part”

ABSTRACT: Statistical Exception Detection System (SEDS) has been successfully used for more than seven years to automatically produce web-based exception reports and smart alerts against a performance database in a large multi-platform environment. This paper gives an overview of how SEDS uses Statistical Process Control (SPC) and Multivariate Adaptive Statistical Filtering (MASF) techniques and how it could be used as part of Lean Six Sigma. It focuses on memory usage exceptions, which SEDS captures, to proactively identify server and application performance issues.

COMMENTS:

- The sample of how SEDS works against Network metrics was preseted there:

- First time the Weekly profile (vs. daily one) Control Chart was introduced as a good source of metric report.

That was actually Ray White’s idea and I use that now as a best graphical representation of performance metric behavior.

This paper is scheduled to be presented again in Raleigh NC SCMG meeting on May 2nd 2008: (http://regions.cmg.org/regions/scmg/spring_08/raleigh/meeting_05_02_08.htm)

Presentation: http://regions.cmg.org/regions/scmg/fall_07/richmond/SEDSCMG2007_v4.pdf

2. Using SAS for Capacity Management (Vendors user group sessions)

Alla Piltser, MerilLynch – “Controlling the Bull: Managing Capacity and Performance Using SAS”

COMMENTS: That’s a Merrill Lynch Experience of providing Capacity Management for large IT shop: TeamQuest based performance monitoring and data collection infrastructure in managed UNIX, Linux, Windows and ESX environments + centralized SAS/ITRM infrastructure + exception based performance management reporting structure. There was a reference to my work as a right way to do exception based reporting.

Frank Lieble, SAS – “Bringing ITL to Life: Automating IT Capacity Management”.COMMENTS: Most interesting part of presentation is the Capacity Management Portal (ITRM based) which includes Tree-map reporting. The tool is good if there is a leak of good statisticians /sas programmers.

BTW SEDS data has been already used for tree-map reporting to underline the most sugnificant exceptions. See my CMG'03 paper for more details and slide above with tree-map example from that paper: (http://regions.cmg.org/regions/ncacmg/downloads/june162004_session3.ppt)

Peg McMahon, Justin Martin, Sprint Nextel “Death to Dashboards: Alarming, Performance Management Based on Variance, System Prioritization and Other Thoughts on Data Visualization”
ABSTRACT: When does the light on the executive dashboard turn from yellow to red? When do you order new hardware? Traditionally, these decisions are handled by setting thresholds — picking some number to use as an upper or lower limit. Thresholds might have worked well in the days of a handful of beloved systems. But for today’s complex environments, thresholding is not only painful to manage but conceptually bankrupt. Let’s talk about the problems with thresholds and dashboards and work to identify some practical alternatives. Vendors, put on your iron underwear and attend this session.

COMMENTS: The main part of this paper is just about what SEDS has been already providing and what has already been presented in my CMG papers since 2001:
“Alarming Based on Variance. The next step towards better performance monitoring is the use of a baseline approach to performance management. Using the power of statistics, the performance metrics can be analyzed to create upper and lower control limits based on the normal variance of the system’s performance. From this analysis, dynamic thresholds can be set based on the normal variance represented by the data. Implementing dynamic, variance based thresholds takes into account the system’s typical workload characteristics. Now, when a back up occurs in the middle of the night, as long as the same back up has occurred at the same time for the past several nights, the CPU threshold is not breached. In theory, an alarm will only occur when the system utilization is above or below a dynamic threshold which outlines the “normal” processing range of the system. This method of monitoring will provide a more refined approach to alarming as it will help to better identify actual performance issues. This is important when the analyst is responsible for monitoring many systems. However, when implemented across thousands of systems, there will likely be several that will have at least one hour which exceeds the variance threshold, and thus triggers alarms. When using a conventional dashboard, this improved level of monitoring creates the same problem as found earlier. How do you prioritize the order in which to resolve the performance issues? In today’s business environment the number of systems is increasing while there are fewer people to manage them. Each system has a unique impact on the business. Understanding a system’s business impact and addressing system performance issues in the correct priority will save a company significant dollars. Using a conventional stoplight dashboard for system performance management will often confuse and delay critical decision making. One way to help address the prioritization problem, using the performance variance data, could be to create a sorted list. This type of report would present the servers having the most performance variance appearing at the top. Using this list, cross-referenced with a list of systems prioritized by their business criticality would be one way to determine which problems need to be addressed first. This method is not intuitive since it requires the analyst to jog between reports. However, it is a way to use the available data in order to make the most business impacting decision. The problem in determining how to quickly prioritize system performance issues is not necessarily due to a lacking in performance data, but rather the lack of a way to properly visualize the performance data....”.

Also the paper presents another example of using a tree-map! Again, the way how tree-map can be used against performance metrics was shown in my and Lin Merritt CMG papers in 2004.

Amit Patel - “Software Performance Lifecycle at a Large National Bank”
COMMENTS: The paper shows some Statistical Process Control (SPC) technique usage. From Abstract: "… Learn how custom monitoring, Six Sigma techniques, performance testing, and daily production reports played an important role in identifying production issues…. "
To build the following control chart the “Minitab” statistical tool was used (http://www.minitab.com/)

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

System Management by Exception

Popular Post

_

Sunday, April 27, 2008

CMG'07 trip report

No comments:

Post a Comment