Popular Post

_

Monday, December 30, 2013

"Review of IT Control Chart" - my new paper in Journal of Emerging Trends in Computing and Information Sciences

Try building IT-Control Charts by free Perfomalist.com web tool:

Full Textat RG
AuthorIgor Trubin
ISSN2079-8407
On Pages857-868
Volume No.4
Issue No.11
Issue DateDecember 01, 2013
Publishing DateDecember 01, 2013
KeywordsControl Chart, Six Sigma Tools, SPC


Abstract

The Control Chart is one of the main Six Sigma tools to optimize business processes. After some adjustments it is used now as visualization tool in IT Capacity Management especially in “behavior learning” products to underline performance and capacity usage anomalies. This review answers the following questions. What is the Control Chart and how to read it and where to use? Review of some performance tools that uses it. Control chart types: MASF charts vs. classical SPC; introduction to IT-Control Chart for IT application performance control. How to build a Control Chart using Excel for interactive analysis and R scripting to do it automatically?


Friday, December 20, 2013

I can be seen at G+,t,f,in,VK,YouTube and finely at Researchgate!

I like to be seen. I am here in Blogger and also you can see me at:
- Google+
- Twitter,
- Facebook,
- LinkedIn,
- VKontakte
- YouTube

and finally I have found where my research writings could be seen in particularly:
- ResearchGate

I am welcome you to join, subscribe and follow!

BTW there are a much more there:
...
I need to consider to be there too.... If you are already somewhere where I am not - INVITE!

Friday, December 13, 2013

CMG’13 paper about VMware memory over-commitment, Memory State performance counter, Ballooning, Swapping and Memory Reservations. A few citations.

 I have attended the Performance Management in the Virtual Data Center: Virtual Memory Management  Mark’s B. Friedman presentation and learned a lot. I would like to share here a few most informative (by my opinion) citations from Mark’s paper about

-        Memory over-commitment
-        Memory State performance counter,
-        Ballooning, Swapping and
-        Memory Reservations.

Introduction.
“…This paper explores the strategies that VMware ESX employs to manage machine memory, focusing on the ones that are designed to support aggressive consolidation of virtual machine guests on server hardware..”

Memory over-commitment
“…Allowing applications to collectively commit more virtual memory pages than are actually present in physical memory, but biasing the contents of physical memory based on current usage patterns, permits operating systems that support virtual memory addressing to utilize physical memory resources very effectively…”

Memory State performance counter
“…VMware’s current level of physical memory contention is encapsulated in a performance counter called Memory State. This Memory State variable is set based on the amount of Free memory available. Memory state transitions trigger the reclamation actions reported in Table 1:

State
Value
Free Memory Threshold
Reclamation Action
High
0
> 6%
None
Soft
1
< 6%
Ballooning
Hard
2
< 4%
Swapping to Disk or Pages compressed
Low
3
<2 span="">
Blocks execution of active VMs > target allocations
..”

Ballooning
“…ballooning occurs when the VMware Host recognizes that there is a shortage of machine memory and must be replenished using page replacement. Since VMware has only limited knowledge of current page access patterns, it is not in a position to implement an optimal LRU page replacement strategy. Ballooning attempts to shift responsibility for page replacement to the guest machine OS, which presumably can implement a more optimal page replacement strategy than the VMware hypervisor.
…  Using ballooning, VMware reduces the amount of physical memory available for internal use within the guest machine.

In Windows, when VMware’s vmmemsty.sys balloon driver inflates, it allocates physical memory pages and pins them in physical memory until explicitly released. To determine how effective ballooning works to relieve a shortage of machine memory condition, it is useful to drill into the guest machine performance counters and look for signs of increased demand paging and other indicators of memory contention….

…ballooning successfully transforms the external contention for machine memory that the VMware Host detects into contention for physical memory that the Windows guest machine needs to manage internally...”

Swapping
“… VMware has recourse to steal physical memory pages granted to a guest OS at random, which VMware terms swapping, to relieve a serious shortage of machine memory. When free machine memory drops below a 4% threshold, swapping is triggered..”

Memory Reservations.

“In VMware, customers do have the ability to prioritize guest machines so that all tenants sharing an over-committed virtualization Host machine are not penalized equally when there is a resource shortage. The most effective way to protect a critical guest machine from being subjected to ballooning and swapping due to a co-resident guest is to set up a machine memory Reservation. A machine memory Reservation establishes a floor guaranteeing that a certain amount of machine memory is always granted to the guest. With a Reservation value set, VMware will not subject a guest machine to ballooning or swapping that will result in the machine memory granted to the guest falling below that minimum…”

Wednesday, November 20, 2013

MSDN Blog post: "Statistical Process Control Techniques in Performance Monitoring and Alerting" by M. Friedman

I met  again at CMG'13 and also attended his session (will put my impressions later).

Mark is my teacher, and I respect him very much. Ones I have attended his Windows Capacity Management class in Chicago. I always try to go to his presentations, to read his books and to see his online activities. Just today, checking his activities online I ran into his 2010 post in MSDN Blog that relates to my (this) blog very much:

MSDN Blogs Developer Division Performance Engineering blog > Statistical Process Control Techniques in Performance Monitoring and Alerting

I very appreciate he mentioned my blog and my name (with a little misprint...):

".... a pointer to Igor Trobin's work, which I believe is very complementary. Igor writes an interesting blog called “System Management by Exception.” In addition, Jeff Buzen and Annie Shum published a very influential paper on this subject called “MASF: Multivariate Adaptive Statistical Filtering” back in 1995. (Igor’s papers on the subject and the original Buzen and Shum paper are all available at www.cmg.org.)... "

This Mark's post was a response on Charles Loboz CMG paper critique made by Uriel Carrasquilla, Microsoft performance analyst. I attended that presentation and had some doubts too which I expressed during the presentation. BTW I have commented another Charles's CMG paper in my blog:  Quantifying Imbalance in Computer Systems: CMG'11 Trip Report. My opinion is this CMG'11 paper was much better!
(Normalized Imbalance Coefficient, from the paper)

BTW I have also made comments on Mark Friedman CMG'08 paper: Mainstream NUMA and the TCP/IP stack. His presentation was as usual very influential! See details in my CMG'08 Trip Report

And I am about to comment his CMG'13 presentation. Check the next post!

CMG’13 workshops: "Application Profiling: Telling a story with your data"

The subject was introduced by R. Gilmarc (CA) in his CMG’11 paper: IT/EV-Charts as an Application Signature: CMG'11 Trip Report, Part 1 This time he has shown us some additional development of the idea. Such as “BIFR”:

What is in our Application Profile?
• Workload – description of transaction arrival pattern
• Infrastructure – subset of infrastructure supporting our application
• Flow – server-to-server workflow
• Resource – CPU and I/O consumed per transaction at each server


Why is an Application Profile useful?
• Prerequisite for application performance analysis and capacity planning
• Directs & focuses application performance tuning efforts
• Building block for data center capacity planning
• Serves as input to a model

Some modeling approaches were included into Application Profile idea (e.g. CPU% vs. Business transactions) plus the flow is presented as a diagram from HyPerformix tool that is now CA tool.
I see the  BIFR profile is suitable for a predictive model  to run on Performance Optimizer part of HyPerformix.

Also interesting  is the attempt to use BIFR for virtual servers (LPARs) consolidation that includes TPP – Total Processing Power benchmarks. Most interesting is the usage of “Composite Resource Usage Index  to Identify LPARs that have high
resource usage across all 3 ones: TPP Percent,  I/O Percent and  Memory Percent. Looks like it allows to combine  LPARS optimally on different physical hosts in a ”tetris” way.

I appreciate he mentioned my name in the slides (at the “related work” section) and during his presentation there was some discussion about IT Control Charts. I still believe that IT-Control chart without actual data plotted (see below a copy from my old post) and built for main server resources usage (CPU, memory and I/Os) plus for main business transactions and response time (the same IT-control charts should be built for that – I published couple examples in my other papers) could be a perfect representation of any applications and also can be treated as an application profile!  


For consolidation or workload placement exercises they can be condensed to a few numbers per application, for instance, maximum of weekly upper limits for each chart. Those numbers could be treated as application profile parameters and then used for placing/moving (in a cloud) purposes, for example to be analyzed by some statistical clustering algorithms. By the way, other Cloud management tools already do similar profiling for this. (CiRBA

Another interesting idea which also was presented in the workshop is “Application invariants”. I may discuss that in my another post…




Tuesday, November 12, 2013

HP techreport: "Statistical Techniques for Online Anomaly Detection in Data Centers". My critique.

SOURCE: HPL-2011-8 Statistical Techniques for Online Anomaly Detection in Data Centers - Wang, Chengwei; Viswanathan, Krishnamurthy; Choudur, Lakshminarayan; Talwar, Vanish; Satterfield, Wade; Schwan, Karsten



The subject of the paper is extremely good and this blog is the place to discuss that type of matter as you can find here numerous discussions about tools and methods that solve basically the same problem. Below the introductory paragraph with key assumptions of the paper that I have some doubts with:


 MASF uses reference set as a baseline, based on which the statistical thresholds are calculated (UCL, LCL), originally the suggestion was to have that static (not changing) over time, so the baseline is always the same. Developing my SETDS methodology I have modernized the approach and now SETDS mostly uses baseline that slides from past to present ending just when most resent “actual data” starts. (and the mean is actually moving average!)  So it is still MASF-like way to build thresholds, but they are changing overtime self-ajusting to pattern changes. I call that “dynamic thresholding”. BTW After SETDS, some other vendors implemented this approach as you can here: Baseliningand dynamic thresholds features in Fluke and Tivoli tools

2     A few years ago I had intensive discussion about “normality” data assumption with founder of the Alive ( Integrien) tool (now it is part of VMware vCOPS): Real-Time StatisticalException Detection. So vCOPS now has ability to detect real-time anomalies applying non-parametric statistical approach. SETDS also has ability to detect anomalies (my original term is statistical exceptions) in real-time manner if applied to near-real-time data: Real-Time Control Chartsfor SEDS

The other part of the paper mentions the usage of multiple time dimension approach, which is not really new. I have explored the similar one during my IT-Control chart development by treating that as a data cube with at least two time dimensions – weeks and hours and also comparing historical  baseline  with most recent data; see details in the most popular post of this blog: One Example of BIRT Data Cubes Usage for Performance Data Analysis:

Section III of the paper describes the way of using “Tukey” method and is definitely valid as the non-parametric way to calculate UCL and LCL. (I should try to do that). I usually use just percentiles (e.g. UCL= 95 and LCL=5) if data are apparently not normally distributed..

The part B of the section III in the paper is about “windowing approaches”. It is interesting as it compares collections of data points and how good they fit to a given distribution.  It reminds me other CMG paper that had similar approach of calculating the entropy of different portions of the performance data. See my attempt to use entropy based approach to capture some anomalies here: Quantifying Imbalance inComputer Systems

Finally the results of some tests are presented in the end of the paper. Really interesting comparison of different approaches, not sure they used MASF and that would also be interesting to compare result with SETDS…But at the “related work” part of the paper unfortunately I did not notice any recent well known and widely used implementations of the anomaly detection techniques (except MASF) that are very good presented in this blog (including SEDS/SETDS).

Tuesday, November 5, 2013

Enjoying CMG'13 conference in La Jolla, CA. Detailed report is coming...

Sunday, October 6, 2013

Forget cloud computing... Soon we will be lost in FOG COMPUTING!

Re-posting my Facebook friend:
"Forget cloud computing. According to Yahoo's white paper, the crux of the new offering is a technology void of any datacenters, drawing instead on the untapped resources that exist virtually everywhere. Those resources can range from unused space on smartphones and other wireless devices to onboard computers and dashboard systems in automobiles to the underutilized brain power of America's teenage population. FOG COMPUTING!"

Thursday, October 3, 2013

I have introduced the SETDS methodology to the following IT organizations.


The SETDS (Statistical Exception & Trend Detection) idea was born in Capital One, and was first published in 2001 in my first www.CMG.org white paper:

Then for the last 12 years during some projects participation I have introduced and in some cases partially implemented SETDS methodology for the following Companies:

- IBM,
- SunTrust,
- Coca Cola,
- WellPoint,
- ING,
- JP Morgan Chase,
- State Farm

Tuesday, September 17, 2013

Performance and Capacity 2013 by CMG.org - I GO!

Wednesday, September 11, 2013

CMG 2014/2015 Board of Directors Election - VOTE FOR ME!

Final update:


2015 UPDATE: Thanks for all who voted for me last year! I am resubmitting my nomination gain for this year.
_______
I have recently  updated my following post: CMG Board of Directors Nomination 
because this year I am nominated. If you are CMG member, VOTE FOR ME!!!
Voting in CMG's annual election is occurring now.  The voting deadline is this Friday, September 13.  Please take part in shaping CMG's future by voting.
To vote, go to CMG's website.  Click on "Members Center Login" in the top right of the screen.  (If you don't know your password you can request that it be sent to you)   Next, click on CMG 2014 Board of Directors Election and the rest is self-explanatory.

Please do this now.  Lots of things are going on in CMG and your vote counts!  

THANK YOU! 

Saturday, June 8, 2013

No obvious thresholds for a metric? - Analyze EV meta-metric to capture saturation point!

Some data (metrics) does not have obvious thresholds. For instance, overall disk I/O rate for a server. You have to analyze each particular disk to find some hotspots using I/O rate in conjunction with the disk busy metric, but it is very intensive work which is hard to automate. How to do that I explained in my CMG’03 paper “Disk Subsystem Capacity Management, Based on Business Drivers, I/O Performance Metrics and MASF.

In that paper I also suggested to use Control Limits as a Dynamic Threshold. Plus I suggested to collect and analyses EV meta-metric called there the “Extra IOs” becouse for I/O rate parent metric the EV (Exception Value) has physical meaning (additional and unusual number of I/Os system processed).

EV meta-metric behaves like 1st derivative of the parent metric. If metric stays constant (between limits) EV=0; If it linearly grows, EV=CONSTANT>0. If it linearly goes down, EV=-CONSTANT<0 and="" on.="" so="">

That fact can help to identify automatically some important patterns in the data history using very simple and universal threshold EV=0. Analyzing the EV trend even could help to predict some future states of the system.

If EV is positive and then got mostly zero that could be indication of the saturation. And the suturation usualy indicates some capacity issue.
If EV was mostly zero and started to be positive that could be indication of the trend beginning.

To illustrate that I am useing the population growth (logistic) curve to simulate some trend starting and the saturation point reaching as that curve naturaly has both pints (S-curve).
Plus I have randomized that curve by adding some random component to simulate a volatility.  See in the picture how EV behaves indicating both events:
 
Of couse the experienced analyst can see just eyeballing when the saturation started, but in the "big data" era we have to deal with dozens thousends systems so we got to automate this type of patterns capture!

And EV could help.

Thursday, May 30, 2013

CMG papers: Knee detection vs. EV based trends detection (SETDS)

The CMG’12 paper “A Note on Knee Detection“ (J. Ferrandiz, A. Gilgur) presented a method of “system phase change” detection by using the piece‐wise linear model against data with any supply‐demand relationship, e.g. CPU vs. transactions, load vs. traffic.


The weakness of the approach is the following underlying assumption in the methodology: the most of the data points are in the low load region. But all in all that is a relatively simple and effective way to capture the fact of constantly exceeding some threshold (confidence level e.g. 95%) of the data beyond detecting “knee”.
I see some similarity in my method of detecting the system phase changes (trends detection implemented by SETDS).

Based on my paper CMG'08 “Exception based Modeling and Forecasting” I use EV (Exception Value) meta-metric to detect pattern change in the data. The phases in the data should be separated by roots of the EV = 0 equation because for EV > 0 the data mostly exceeds the upper control limit and for EV < 0 – it is mostly below low control limit, and data is stable where EV=0.


But I used my way only for time series data (EV= f(t)), so detected phases are separated by points in time. Is that possible to apply my EV based approach to non-time series data? Not sure. But knowing that EV is just the deference between actual data and control limits (e.g. percentile based), the above mentioned knee detection algorithm could be a some kind of EV- based approach applied to non-time series data…
And I believe EV based approach is free from the assumption that the data should be somewhat misbalanced and it can also detect multiple “knees” with both directions (up and down). Only one caviar still exists (the same with knee detection algorithm): too many phases detecting, but that could be tuned by limits change, e.g. from 95 percentile to 99 and by grouping data (like aggregating minutes to hours for time-series data).


Tuesday, May 21, 2013

"Is your Capacity Available?" - A topic for CMG'13 Conference Paper

2016 UPDATE.Finaly the paper was written, presented and published www.CMG.org
Here is the presentation slides:

_____
Capacity Management and Availability Management are two interconnected services. That connection is getting more important in current era of virtualization, clustering and especially for  cloud computing. It is obviously, that IT customers not only want the sufficient capacity for their applications, but even more importantly want that capacity highly available.

I have to deal with that combination recently and firmly believe that could be a great topic for up-coming CMG'13 conference. I plan to attend the conference, but unfortunately  in spite I have a topic, the title "Is your Capacity Available?" and some already published in this blog related materials, I do not have ability to write the paper for this year CMG conference. 

May be somebody could pick that idea up and share your experience in this mix area of Capacity and Availability? I would be extremely happy!

Here is the list of my posts related to this subject:




BTW The last post has some suggestion to estimate each node (component) availability (which is needed for a cluster availability calculation) by just looking at the Incident records history and use MTTR from there. Why not? If you have a good Incident Management that could be a very cheap solution! I would suggest to calculate different degrees of estimated availability, such as "Absolute" availability estimation based on up-time completely free from any incidents. Or "N-degree availability" number  if only severity <= N incidents are taken in account or filtering out only incidents  related to a particular component capacity - "Capacity Availability". Sure if Incident Management service is not mature enough, that will provide incorrect input for estimation. So you may consider other mechanisms I mentioned in that post... But from other hand that would encourage you (maybe via CSI) to improve your Incident management service!





Tuesday, May 7, 2013

CMG'12: "Time‐Series: Forecasting + Regression"

I continue sharing my impressions about last 2012 year international www.CMG.org  conference.


(see previous: HERE and HERE).

This post is about Alex Gilgur paper. First time I met Alex in CMG'06 conference and put some note of his 2006 paper HERE. We met at several other CMG conferences and we talked a lot, mixing high matters (philosophy, physics and math) with our Capacity Management needs.... One of that discussion expired him to write the following paper:
Time-Series: Forecasting + Regression: “And” or “Or”?

In the agenda announcement he mentioned that fact and I really appreciate it:
"At CMG’11, I had a fascinating discussion with Dr. I.Trubin. We talked about Uncertainty, Second Law of Thermodynamics, and other high matters in relation to IT. That discussion prompted this paper..."

Reading the paper I was impressed how he combined the trending analysis with the business driver data correlation technique. I do that quite often, one example was published in my CMG'12 paper as well and summary slide can be seen here:

:

But in his work the technique is formed in a very elegant mathematical way and also he used Little’s Law to fight the most unpleasant statisticians rule: "Correlation Does Not Imply Causation". 


In the next post I will put some comments about his another CMG'12 paper called:
"A Note on Knee Detection"


Friday, May 3, 2013

SCMG report: Business and Application Based Capacity Management

Another presentation at our April 2013 SCMG meeting was very interesting. Ann Dowling presented the result of large project she led to improve the capacity management of some big IT organization. Here is the title and the link:

Business and Application Based Capacity Management

Actually I was involved in that project too. I was at the beginning of the effort to build custom Cognos based reporting process. We tried to use BIRT reporting and then successfully switched to Cognos.

Interesting, that the presentation has mentioned “Dynamic thresholds”. I thought it is something like control charts do (and by the way during my work on that project I found "out-of-box" BIRT based control chart reports in the TCR and I suggested to use it or to build better ones using COGNOS), but looks like by “Dynamic thresholds” they mean “The ability to change (manually?) to meet customer department adjustments. The threshold metrics can be at the prompt page.”…

I prefer using that term for thresholds that a reporting system automatically changes based on a behavioral learning process.

Much later I have played with BIRT and COGNOS to build control charts. See examples below:
BIRT:
COGNOS:

Wednesday, May 1, 2013

SCMG report: Jobs Scheduling in Cloud and Hadoop

2015 UPDATE: Now having access to HADOOP I am thinking of how to use Map-Reduce to speed up SETDSing against the big performance data. The flowing thesis could very helpful for that:
Distributed Anomaly Detection andPrevention for Virtual Platforms by Ali Imran Jehangiri

Last week we ran our Richmond SCMG meeting following the agenda published HERE (links to presentations are there too including mine). The 1st presentation was titled: "Some Workload Scheduling Alternatives for High Performance Computing Systems" and presented by Jim McGalliard, the frequent CMG presenter and our friend. He mentioned some old topic he already presented in the past – Supercomputer batch jobs optimization by categorizing and scheduling them. Then after a brief description of MapReduce


( “method for simple implementation of parallelism in a program..”)

he explained how HADOOP
(“Designed for very large (thousands of processors) systems using commodity processors, including grid systems, Hadoop is a specific open source implementation of the MapReduce framework written in Java and licensed by Apache” )

does job scheduling using MapReduce and some other means.

That presentation leads me to another task to consider – job scheduling in the cloud. Ironically just before the meeting I had red interesting article about it (BTW it was recommended reading from my current manager as we are also going to clouds… What about you?). Here is the link to that article from one of the author’s webpage (Asit K Mishra ) and title:

”Towards Characterizing Cloud Backend Workloads: Insights from Google Compute Clusters”

I firmly believed that the workload characterization is going away due to virtualization - each workload/apps can have separate virtual server now. Right? But based on the article looks like the jobs categorization could be useful for optimizing their schedule to run in the cloud and maybe in HADOOP…

Saturday, April 27, 2013

Modeling the Online Application Migration from Sparc to AIX Platform



The complete set of slides for the presentation I made at SCMG Richmond meeting see here: http://itrubin.blogspot.com/2013/04/aix-frame-and-lpar-level-capacity.html
The presentation itself was recorded and published here:

Thursday, April 25, 2013

I. Trubin: AIX frame and LPAR level Capacity Planning. User Case for Online Banking Application









[1] Bob Chan: “Unix Server Sizing – Or What to do When There are No MIPS”, Proceedings of the Computer Measurement Group, 2000
§[2] Ray White and Igor Trubin: “System Management by Exception, the Final Part”, Proceedings of the Computer Measurement Group, 2007.
§[3] Linwood Merritt and Igor Trubin: “Disk Subsystem Capacity Management, Based on Business Drivers, I/O Performance Metrics and MASF ”, Proceedings of the Computer Measurement Group, 2004
                         
§[4] Igor Trubin: “Exception Based Modeling and Forecasting”, Proceedings of the Computer Measurement Group, 2008
§[5] Igor Trubin: “IT-Control Chart“, Proceedings of the Computer Measurement Group, 2009
§[6] Jeffrey Buzen and Annie Shum: "MASF - Multivariate Adaptive Statistical Filtering", Proceedings of the Computer Measurement Group, 1995, pp. 1-10.
§[7] Igor Trubin: “How To Build IT-Control Chart - Use the Excel Pivot Table! “, “System Management by Exception” tech. blog www.itrubin.blogspot.com   
§[8] Linwood Merritt: “A Capacity Planning Partnership with the Business”, Proceedings of the Computer Measurement Group, 2004.