Popular Post

Search This Blog

Tuesday, December 29, 2009

Exception Value (EV) and OPNET Panorama

I have recently looked at the following OPNET resources to get impressions of OPNET Panorama tool:

1. Link to website: http://www.opnet.com/solutions/application_performance/panorama.html
2. White paper downloaded from that site: "Understanding OPNET Panorama’s Performance Analysis Engines"

General comment: OPNET becomes the next generation tool provider that has “learning behavior” capabilities that are similar with what I do for years with my SEDS and with tools from other Vendors, such as Netuitive, Integrien and ProactiveNet (BMC) that I have recently studied (check my older postings: http://itrubin.blogspot.com/2009/02/realtime-statistical-exception.html).

Special comment: In the OPNET white paper I read: "Metrics that exhibit deviations from normal are automatically identified and assigned scores based on “how abnormal” their behavior is."
This is very close to what I introduced in my 1st CMG paper in 2001("Exception Detection System, Based on the Statistical Process Control Concept" ) and called ExtraVolume of a metric (I call that now Exception Value (EV) meta-metric). In OPNET’s white paper referenced to that my CMG’01 paper, but they did not mention that they use very similar approach to rang exceptions (“Area Out vs. Limit Range In Metric Scoring”).
Here is example from my 1st CMG paper of the usage EV metric to build the TOP exceptional Unix servers list:


I had even tied to normalize that metric to some Unix benchmarks (TPC) to compare ranges of exceptional capacity usage ecross diferent type of servers and configurations. The example of the report is in the paper.

Al in all, it is a good news that another vendor has been adopting that technology (maybe with some of my work’s influence!). Based on my experience with OPNET tools (very limited, just ITD Guru for network and some server behavior simulations), that tool most likely can be trusted. To speak more I need at least to play with demo….

Tuesday, November 10, 2009

SEDS-Lite Introduction

In the purpose of sharing in codes some ideas of exception detection metodology I am developing the SEDS-Lite version using "R"-scripting (http://www.r-project.org/). One of the scripts (cchrt.r - see front picture of this post) has been already published on my blog and that builds control charts against CSV data: http://itrubin.blogspot.com/2009/03/power-of-control-charts.html




How exactly that works  and more R-scripts will be presented on my workshop at up-coming CMG'09 conference in Dallas (http://itrubin.blogspot.com/2009/07/my-cmg09-sunday-workshop.html ) - you are welcome to attend!

Some additional R-scripts can be found in my SCMG presentation http://itrubin.blogspot.com/2009/05/seds-charts-at-scmg.html

The trick is the SAS 9.2 can execute R scripts. You can also try another SAS-like product (http://www.teamwpc.co.uk/products/wps) which also understands R, plus there are some ways to use SAS data for R-graphing: http://www.hollandnumerics.co.uk/pdf/SAS2R2SAS_paper.pdf  or just using the "SAS and R" good book ( I have recently bought that and highly recommend): http://sas-and-r.blogspot.com/

Wednesday, October 21, 2009

Lower Control Limit Usage Examples for IT Capaciy Management

I have recently posted the following question as LinkedIn discussion subject for "Statistical Process Control" group: "Does it make any sense to use Control Charts for capacity management?" and got one pessimistic comment, which included the following statement:

"...The only situation I can think of using a control chart for capacity is if you had a piece of equipment that if over utilized would cause damage or premature wear in which case you would only have an upper control..."

I disagree. My system (SEDS) has a special part (updated lists) called "Unusual Capacity Usage OUTSIDERS" that can help to capture some serious issues with servers, such as database going down, LPAR migration out of a host and other unusual capacity releases, that  are not necessarily good things:

The following control charts from my up-coming CMG'09 workshop presentation are good illustrations of those type of finding SEDS captures:

1. Vmware host issue (VM migration):



2. Unisys server database is down:




3. Mainframe application unusual low CPU usage:


Sunday, September 20, 2009

Near-Real-Time IT-Control Charts

On the next Thursday September 24, 2009 in the Richmond's SCMG meeting I am going to present my updated version of previous presentation called "Power of Control Chart". This time the focus is on Near-Real-Time IT-Control Charts. Below is the clip that shows the example of Near-Real-Time IT-Control Chart simulated by R-program:

video

The presentation will be published in SCMG site: http://regions.cmg.org/regions/scmg/fall_09/richmond/meeting_09_24_09b.htm

Thursday, August 27, 2009

IT-Chart: The Best Way to Visualize IT Systems Performance

How to see most current metric data, most recent one and also in retrospective, but in one picture? Is that possible? Yes, it is.
I guess the simple radar in a plane or ship cockpit refreshes current data on a top of most recent and shows approaching "future". The SEDS control chart is similar and uses the border line to separate current data and most recent one. Plus it gives a historical base-line to show you what can be expected and for comparison.

I believe it is powerful way to visualize IT Systems Performance, so I made up a name for that chart: "IT-CHART".
(Not only because IT is my initials....)

I plan to add the pictures from this blog as additional slides to my CMG'09 workshop, which includes the R-script to build IT-CHART based on CSV input. (see abstract here)

Thursday, July 23, 2009

Real-Time Control Charts for SEDS

I still analyze different tools that capture computer application abnormalities based on real-time data. In addition to Integrien (now it is a part of VMware's tool called AliveVMand Netuitive I have recently looked at BMC ProActive Net Analytics. I have spoken with BMC SMEs and they showed me a live demo of the tool. I always respected BMC (and espetialy BGS) as actually the inventor of this approach (MASF) and long ago I used to analyze statistical exceptions using BMC Visualizer and BMC Perceive (BTW I have published in my papers a few examples how I did that) . Now they have another and very good tool for the same purpose (http://documents.bmc.com/products/documents/49/13/84913/84913.pdf)

Watching the live presentation I got a positive impression of how that works for complex applications and transactions correlating different abnormal events with possibility to reduce false positives situations. Interesting that the combination of dynamic and static thresholds are used there to generate alarms. Just like SEDS does - static one to capture hot issues (run-aways and leaks) and statistical ones for early warnings.

Now I have a very difficult task to choose from those three products (plus SEDS) to recommend to my management...

Speaking about SEDS, I have decided to play with near-real time data to see how difficult would be to redesign SEDS making it works more similar with mentioned above modern and serious tools. Fortunately SEDS is just a bunch of SAS Marcos with parameters which helped me to make the adjustment needed to include today's data. And surprisingly that was pretty easy task! I spent only a couple days to developed a "real-time SEDS" prototype. Currently what it only does is building every hour the real-time Control Charts that can be seen at the beginning of this post.

I plan to include some details about real-time Control Charts to my upcoming CMG'09 Workshop.

Monday, July 6, 2009

My CMG’09 Sunday Workshop

(2010 UPDATE: based on the workshop the CMG'10 paper is written and will be published in the CMG conference - http://itrubin.blogspot.com/2010/11/my-cmg10-presentation-it-control-charts.html) 

My workshop entitled
"Power of Control Charts: How to Read, How to Build, How to Use
has been accepted for the CMG’09 Sunday Workshop program to be held at the Gaylord Texan in Dallas, Texas, December 6, 2009 (http://cmg.org/conference/cmg2009/)

The workshop proposal is following:

One of the most powerful ways to visualize computer system behavior is the Control Chart. Originally used in Mechanical Engineering, it has become one of the main Six Sigma tools to optimize business processes, and after some adjustments it is used in IT Capacity Management area especially in “behavior learning” products.

During the workshop the following topics will be discussed: What is the Control Chart? Where the Control Chart is used: review of some systems performance tools that use it. Control chart types: MASF charts vs. SPC. Gallery of already published charts in CMG papers plus some new charts with explanations on how to read them. How to build a Control Chart: using Excel for interactive analysis and R to do it automatically. The session includes a live demonstration of Excel to build different types of control charts against real performance data. Attendees will be provided CDs with the data in spreadsheets and will build Control Charts themselves even with their own data. Finally, they will be able to run an R-script to build a Control Chart based on input CSV data.

This workshop is based on series of CMG papers published by the author. The prototype of the workshop was presented twice this year in Southern CMG meetings in VA and NC.

The presentation slides are already published here: http://regions.cmg.org/regions/scmg/spring_09/PowerOfControlCharts.pdf
The most current and updated version can be ordered from CMG.org: http://www.cmg.org/downloads/sunday_workshop.pdf 

Thursday, July 2, 2009

Capacity Management Found in Translation

I have just created my 2nd blog to share my technical ideas and thoughts in Russian. If you can read Russian please visit http://www.ukor.blogspot.com/.

The name of my new blog is "Управление Вычислительной Мощностью" which simply means “Capacity Management”. That term translation I have recently found in a Russian article (click here to read) which was published in 2008 by Enterprise Systems and Software Laboratory, HP Laboratories Palo Alto. I was so glad that I had finally figured out how “Capacity Management” is said in Russian! The past 10 years doing Capacity Management I always had a problem explaining to my Russian friends and relatives what my occupation was! Now I know and that fact inspired me to start my new blog for Russian readers.

Another reason is the 20th anniversary of my 1st program which I wrote and sold. That was the graphical editor with some CAD features I wrote using FORTRAN for PC with PDP type of processor (DVK-3). The name of that program was UKOR (In Russian that means “REPROOF”). That’s why the link to my new blog is “ http://www.UKOR.blogspot.com/ “!

Wednesday, June 17, 2009

Management by Exception: Business vs. System

Management by Exception is actually an old idea and it is used for the Business Process management and even for the Accounting as defined in the following website that I have recently found: http://www.allbusiness.com/glossaries/management-by-exception/4944378-1.html

Wikipedia, referring to the same source, defines Management by Exception as a
"policy by which management devotes its time to investigating only those situations in which actual results differ significantly from planned results. The idea is that management should spend its valuable time concentrating on the more important items (such as shaping the company's future strategic course). Attention is given only to material deviations requiring investigation."

I would say if one applies this definition to IT, it turns to my term "System Management by Exception" where the "management" is Capacity management analysts or Capacity planners and "material deviations" are servers or applications' exceptions.

Speaking about applications' exceptions, currently I am working on applying “Management by Exception” approach not to servers farm capacity management (I think I have already done this successfully) but to a set of applications to produce automatically the list of only those applications that are having some exceptions (unusual but not yet deadly behavior) to help providing proactive application capacity/performance management. Why? Because in some IT environments with large number of applications the centralized capacity management does not exist and application support teams have to play that role and SEDS (System Management by Exception tool) should deliver automatically them what systems need attention within each exceptional application.

Wednesday, June 10, 2009

CMG Board of Directors Nomination

Final update:

2015 UPDATE:
Thanks for all who voted for me last year! I am resubmitting my nomination again for this year.

2014 UPDATE:
This year I was nominated again !
So, if you are a CMG.org  member, please vote! How to vote check HERE.

By Chair of 2009 CMG Nominating Committee I have been asked to nominate myself to CMG Board of Directors. Apparently I am qualified for that and I believe it is a great honor. I have decided to do that and below it is my nomination statement.

Willingness to Serve:
CMG has been an extremely valuable part of my professional life for the past ten years. Because of CMG, I became a known specialist in IT Capacity Management discipline! I have already worked at the local level to support the organization and would like to serve on CMG's Board of Directors to continue promoting the organization throughout the IT community. My company and family members support my involvement with and commitment to CMG.

Professional Work Experience:I have over 30 years of experience in the IT field. I have started my career in 1979 as an IBM 370 system engineer. In 1986, I received my PhD. in Robotics at St. Petersburg Technical University (Russia), where I then taught full-time such subjects as CAD/CAM, Robotics and Computer Science for about 12 years. I have published more than 30 papers and made several presentations for different international conferences related to the Robotics, Artificial Intelligence and Computer fields. In 1999, I moved to the US and worked at Capital One bank as a Capacity Planner. My first CMG paper was written and presented in 2001. The next one, "Global and Application Level Exception Detection System Based on MASF Technique," won a Best Paper award at CMG’02 and was presented again at UKCMG’03 in Oxford, England. My CMG’04 was republished in the IBM z/Series Expo. I also presented my papers in Central Europe CMG conference (Austria) and at numerous US regional meetings. After working more than two years as the Capacity Management Team Lead for IBM, in 2007 I have accepted a Senior Capacity Planner position at SunTrust Bank where I am currently employed.

Other Professional Experience:I have a long experience working as a programmer. I have also acquired extensive managerial experience working as the Head of the CAD/CAM University’s lab and Team Lead at IBM. Since March 2005, I have been severing as Vice Chair of Southern CMG, providing vendors connections.

Candidate Statement:I believe that I am uniquely qualified and motivated to serve CMG and its future development, as the IT landscape changes. My major accomplishment is Statistical Exception Detection System (SEDS) for IT Capacity Management. SEDS ideas and techniques are published in a series of my CMG papers over the period of the last ten years and also in this technical blog. My position as a Capacity Management expert and my dedication to the CMG organization will allow me to contribute in substantial ways. I further believe that my teaching experience could enhance CMG’s training and educational services for technical community. If elected, I will diligently pursue innovative ways to strengthen the organization’s membership. I will continue the CMG’s dedicated tradition of volunteerism and will actively seek ways to support and improve CMG's commitment to supporting its members.

If you are CMG member, please vote for me!

Thursday, May 7, 2009

SEDS charts at SCMG

SCMG has just held two great meetings:


My presentation "Power of Control Charts" was well received. The slides are published there: http://regions.cmg.org/regions/scmg/spring_09/PowerOfControlCharts.pdf
I was able to demonstrate in live some control charts building technique including R scripting. It encouraged me to submit workshop proposal for CMG'09 "Power of Control Charts: How to Read, How to Build, How to Use".



If it's accepted, please come to see my workshop in December 6 in Dallas, Texas : http://www.cmg.org/conference/!

P.S. Interesting news I got from SCMG meeting about R: The SAS v. 9.2 can execute R scripts. Does anybody try?

Wednesday, March 25, 2009

Power of Control Charts




This spring SCMG meetings have my new presentation "Power of Control Charts" scheduled:


I plan to present this as a workshop, which will be about the following:

- What is the Control Chart? - A little bit of theory and history.
- Where the Control Chart is used: Review of some systems performance tools on a market that built and use control charts.
- How SEDS uses that - MASF charts vs. SPC ones; long gallery of already published charts in the CMG papers plus some new ones with explanations how to read them.
- How to build Control chart: using Excel for interactive analysis and
R to automate the control charts generating with live demonstration of the technique.

Here is the last part of the presentation:
R script (http://www.r-project.org/) to build monthly profile of some real Unix file system space utilization in form of a monthly Control Chart.

The following are the input data (CSV file) and R script.

day,Current Month Data,UpLimit,Mean,LowLimit
1,0.45,0.54,0.42,0.31
2,0.45,0.54,0.42,0.31
3,0.45,0.54,0.42,0.31
4,0.45,0.54,0.42,0.31
5,0.45,0.54,0.42,0.31
6,0.45,0.53,0.43,0.32
7,0.45,0.54,0.43,0.32
8,0.45,0.54,0.43,0.32
9,0.45,0.53,0.43,0.33
10,0.45,0.53,0.43,0.33
11,0.45,0.53,0.43,0.33
12,0.72,0.53,0.43,0.33
13,0.72,0.53,0.43,0.33
14,0.72,0.53,0.42,0.32
15,0.45,0.53,0.42,0.32
16,0.45,0.55,0.43,0.31
17,0.45,0.55,0.44,0.33
18,1.00,0.54,0.44,0.33
19,0.84,0.54,0.44,0.33
20,0.84,0.54,0.44,0.34
21,0.84,0.54,0.44,0.34
22,,0.54,0.44,0.34
23,,0.52,0.44,0.36
24,,0.52,0.44,0.36
25,,0.51,0.43,0.36
26,,0.66,0.46,0.26
27,,0.66,0.46,0.25
28,,0.62,0.45,0.28
29,,0.62,0.45,0.28
30,,0.54,0.43,0.32
31,,0.54,0.43,0.32

## R script to plot control chart to jpeg agaist CSV input - I.Trubin 2009 
jpeg("C://CMG/2009/cchrt.jpg")
cchrt <- read.table('C:/Users/TIgr/CMG/2009/cchrt.csv', header=T, sep=",")
plot (cchrt[,1],cchrt[,2],type="l",col="black",ylim=c(0,1),lwd=2,ann=F)
points (cchrt[,1],cchrt[,3],type="l",col="red", ylim=c(0,1),lwd=1,ann=F)
points (cchrt[,1],cchrt[,4],type="l",col="green",ylim=c(0,1),lwd=1,ann=F)
points (cchrt[,1],cchrt[,5],type="l",col="blue", ylim=c(0,1),lwd=1,ann=F)
mtext("Space Utilization",side=2, line=3.0)
mtext("days of month", side=1, line=3.0)
mtext("CONTROL CHART", side=3, line=1.0)
legend(9,0.3,c("Current Month","UpperLimit","Mean","LowerLimit"),
col=c("black","red","green","blue"),lwd=c(2,1,1,1),bty="n")
dev.off()

Result is in the picture at the beginning of the posting.

Welcome to my presentation!

(Other examples posted here: Near-Real-Time IT-Control Charts )

Saturday, February 28, 2009

Real-Time Statistical Exception Detection

Does that make sense to apply statistical filtering to real-time computer performance data? I did not try as I believe analyzing last day data against historical baseline (based on dynamic statistical thresholds) would be enough to have good alert for upcoming issue and at the same time classical alerting system (based on constant thresholds, for instance, patrol or sites-scope) captures severe incidents if something completely dying.

But I see some companies do that using the following three (at least) products available on a market:

1. Integrien Alive™ (http://www.integrien.com/ )
2. Netuitive (http://netuitive.com/ )
3. ProactiveNet (now BMC), (http://documents.bmc.com/products/documents/49/13/84913/84913.pdf )

Plus Firescope http://www.firescope.com/default.htm and Managed Objects http://managedobjects.com/ do something similar)

I have recently had discussion with Integrien sales people as they did live presentation of Alive product for company I work for now.
I was impressed, it looks working good. Most interesting for me is the deference between SEDS (my approach) and their technology.

Apparently both approaches are using dynamic statistical thresholds to issue an alert.

But I think they do that using some patented complex statistical algorithms that should work well even if sample data is not normally distributed. It’s done based on some research that Dr Mazda A. Marvasti did and I am aware of this research as some of his thoughts was published in CMG (in MeasureIT) couple years ago. That consists of very good critic of SPC (Statistical Process Control) concepts applied to IT data as SPC works perfect if data is normally distributed and if not, it works not so perfect. The 1st attempt to improve SPC was MASF to regroup analyzed data and after regrouping data might be more close to normal. SEDS is based on MASF and, for instance, it looks at history in different dimension by not comparing (calculating st. deviations) hours during the same day but grouping hours by weekday and also it calculates statistic across weeks not days.

(You could find more details in my last paper. Links to some papers related to this subject including my papers can be found in this blog )

BTW In respond on his publication I did special analysis to see how far from normal the data is used by SEDS and some result of this research has been published in one of my papers. And my opinion is some data is close to normal and some still indeed is not so close and it depends of metrics, subsystems and environment (prod/non-prod) and how it’s grouped.

The key is what type of threshold the SEDS-like product uses to establish a base-line. That could be very simple – static one, or based on st. deviations, but that could be more complex thresholds such as combination of static (based on expert experiences – empiric) and simple statistical ones (based on st. deviations). SEDS uses that combination and SEDS has a several tuning parameters to tune SEDS to capture meaningful exceptions. I believe this approach is valuable (and cheap) for practical usage and several successful implementations of SEDS proves that.

But for more accurate analysis of data especially if it’s far from normal destitution, other more advanced statistical techniques could be applied and looks like this product implements that. For me it’s just another (more sophisticated) threshold calculation for base-lining. Anyway I am continue improving my approach and will be thinking about what they and others do in this area.

Other interesting observation I got from  the Integrien tool live presentation:
The rate of dynamic threshold exceeding is so large that they have to put additional (static???) threshold considering that some number of exceptions are kind of normal and just a nose that should be ignored. That means if the number of exceptions is bigger that that threshold, the smart alert is issued. I did not get how this threshold is set or calculated, but it’s very high - HUNDREDS (!!!) of exceptions per interval. I believe the reason of this is they apply “Anomaly” detector to too granular data. As I stated in my last paper the better result could be reached by doing statistical after some stigmatization (SEDS does that mostly after averaging that to hourly data)

BTW SEDS uses original meta-metric to detect only meaningful exceptions (it uses EV or Exception Value - see my last paper) that allows SEDS to have fault positive rate very low.

Saturday, January 24, 2009

CMG'08 Trip Report

Visualization and Analysis of Performance Data using R
Jim Holtman

Summary:
I did not attend this, but that is about free statistical and graphical tool (“R” tool and “S” language http://www.r-project.org/ ). Note: there is interface to SAS dataset function in open lib: http://lib.stat.cmu.edu/S/dataset
Functions that define and manipulate S "dataset" objects. A dataset is a matrix whose columns (variables) may be of different data types. Though motivated by a need to interface to SAS, they are useful in any data analysis. There is some function that relates to SPC: JohnsonSystem (http://lib.stat.cmu.edu/S/JohnsonSystem.q)
In 2004 he published CMG paper about R usage: The Use of R for System Performance Analysis . See also Lecture: Graphing in R (http://www.ats.ucla.edu/stat/r/library/lecture_graphing_r.htm) or http://ieee.cincinnati.fuse.net/R_IEEE_V2.pdf

Major takeaways: That might be a good SAS/Graph replacement. I also think about writing some "S" program to build SEDS type of Control charts to illustrate how that works, for instance THAT COULD BE USED for a workshop similar Mr. Holtman had done.


Automating Process Pathology Detection – Rule Engine Design Hints
Ron Kaminski

Summary:
This is about analytical approach to capture pathologies like run-away and memory leaks. BTW Ron referenced my papers as an example of different (statistical) approach to do the same. This is continuation of his previous work in this field: http://www.cmg.org/proceedings/2003/3027.pdf
In private conversation he actually expressed some interest to put together both approaches to see how that works from different angles... I am opened.

CMG-T: Modeling and Forecasting
Speaker: Dr. Michael A. Salsburg

Summary:
Just a good an overview and tutorial for queuing theory and simulation based modeling and forecasting vs. statistical modeling-forecasting way I presented in my paper.

eBay - the Shape of Infrastructure to Come
Speaker: Paul Strong

Summary:
Cloud computing is a “Outsourcing 2.0”, sooner or later even banks will use that approach to use capacity on-demand from cloud instead of having own computer farm….

Exception Based Modeling and Forecasting
Speaker: Dr. Igor A. Trubin

Summary:
This is my presentation which was successful and attracted more than 60 attendees. There were a lot of questions and comments during and before this session, positive comments were received from Mark Friedman (After I had to clarify for him 3-D concept of weekly control charts... - my bad ,I was probably not very clear presenting that...) and Ron Kaminski who expressed some interest in my EV algorithm to capture recent bad trends as that solves some problems of workload pathology recognition on which he has been working recently.

So You Want to Manage Your z-Series MIPS? Then Detect & Control Application Workload Variance!
Speaker: John S. Van Wagenen, Caterpillar

Summary:
Unfortunately I could not attend this session as I presented mine in the same time. But this paper is about SEDS-like approach to manage Mainframe capacity! And that presentation got prestigious Mullen award!
There is a similar paper written by the same author last year: Performance Monitoring Process for Out of Standard Applications
Major takeaways:
SEDS approach is valid and our implementation on mainframe might be adjusted using this paper methodology.

Predicting the Relative Performance of CPU
Speaker: Debbie Sheetz

Summary:
I used similar approach (see my 1st CMG paper and 1st figure in my last paper) in the past and know how challenging is to apply SPEC or other benchmarks to real servers with different configurations.
Major takeaways:
This paper could be helpful in соме consolidation projects.

Panel: Michelson Panel - Visualization
Speaker: Jeff Buzen

Summary:
That was interesting to see deferent ways to present data visually. During this panel discussions I realized that my weekly control charts and especially 3-D version of that are kind of unique. I have even approached Dr. Buzen, with my comments about that…

Mainstream NUMA and the TCP/IP stack
Speaker: Mark B. Friedman

Summary:
This is brilliant but very scary paper. Two scary points:
A. For multicore servers the speed of memory access could be unpredictable and sometimes deadly slow because of NUMA – non universal memory access. And there are no any metrics or tools to measure that!
B. High performance network (1-10 and higher Gb) cannot be fully utilized, because it might consume all CPU cycles only to process network related interrupts.

Major takeaways:
It’s OK if network interface bandwidth utilization is low. And we should be careful with using modern multicore processors (8 and more cores).


Performance and Capacity Management in an Outsourced Environment
Speaker: Jeff Hammond

Summary:
This is very useful information about what we could expect working with outsourced service (people) or if we got outsourced ourselves. It confirms my own experience.
Action Items: Be prepared just in case!