Popular Post

Search This Blog

Tuesday, December 28, 2010

Tim Browning: the review of cloud computing article "Optimal Density of Workload Placement"


Bottom line: a cloud computing resource is really a data center with virtualized components.  A GUI-frontend to an outsourcing arrangement.
           
Maybe the only true “cloud computing” takes place in aircraft. Although, that is debatable.





The Cloud Hype in the paper:

The author proclaims that cloud computing “is not simply the re-branding and re-packaging of virtualization”…then proceeds to show that it is just that. He also states that capacity planning’s use of “trend-and-threshold” analytics is not useful in the cloud infrastructure, yet he defines ‘strategic optimization’ as “proactive, long-term placement of resources based on detailed analysis of supply and demand (compacting)”.  I  assume he does not understand that ‘supply’ is a threshold – we only have a finite amount of ‘supply’ -  and that ‘long-term’ is a trend?

He also states

“Rather than the trend-and-threshold model of planning that is typically employed in legacy physical environments, this new form of planning [my emphasis] is based on discrete growth models (at the VM and/or workload level) and the use of permutations and combinations to determine when to rebalance, when to add or remove capacity, and how the environment will respond to different growth, risk and change scenarios.” 

So, I ask myself,  what’s new about ‘discrete growth models’? Where does he get the “growth, risk and change scenarios” -- (wait, don’t tell me…from trend-and-threshold thinking)?  Maybe he is being discreet about the discrete models (thus avoiding being discreetly indiscrete)?

 Permutations and combinations say nothing about end-state solutions relative to (long or short term) time-series load patterns. They are time static, so ‘when to add or remove’ is not part of those computational functions. Perhaps, what arrangement is ‘best’ is what he is meaning?  Perhaps he is thinking of ‘on demand’ capacity wherein capacity planning is replaced by ‘instant’ capacity in response to ‘change’? Which is to say, there is no planning…just rapid and efficient deployment of some kind of limitless unseen capacity?

What is ‘new’ about combinations and permutations? The newest development I know of in this area is perhaps combinatorial optimization, which consists of finding the optimal solution to a mathematical problem in which each solution is associated with a numerical “cost”. It operates on the domain of optimization problems, in which the set of feasible solutions is discrete or can be reduced to discrete (in contrast to continuous), and in which the goal is to find the best solution (lowest cost). (Developed in the early 70’s as linear and integer programming in operations research and similar to the root mean square error criteria for evaluating competing forecast models using neural networks or statistical methods).

So, knowing how many ways you can combine 887 disks on the same I/O path (combinatorics) tells me when to add or remove some if referenced to a discrete growth model? Wow.. yes, that is so NEW…well, for 1968, maybe. 

 Subsequently, he states

 “the natural changes in utilization over time caused by organic growth will tend to push the limits on the configured capacity. Furthermore, the ability to configure capacity is relatively new to IT, and there are typically no existing processes in place to catch misallocation situations.” 

Perhaps the ability to “configure capacity” is new to him, it is in no way new to enterprise IT.  So, trend – a legacy term -  is not, per the author, ‘changes in utilization over time’ and ‘configured capacity’ is not a threshold? There are ‘no existing processes in place’ to catch misallocation situations? What? None? I suppose by ‘misallocation situation’ he means that a capacity shortfall isn’t a capacity issue, it’s an “allocation issue”. Somewhere – over the rainbow -  there is capacity going to waste, but it’s not available for some reason. It’s just been ‘misallocated’. Sort of…misplaced. We must go find it. Instantly.

OK….So do I like anything about this paper?

Some ideas in the paper I DO like:

Workload density – the degree of consolidation of work into one image (of the OS) - is a cool concept where ‘contention for resources’ is a boundary condition for ‘workload placement’. How is this done? “Contention probability analysis”, which involves analyzing the operational patterns and statistical characteristics of running workloads in order to determine the risk of workloads contending for resources. The author uses the phrase, “Patterns and statistical characteristics”. So, in effect, ‘contention probability analysis’ is a ‘trend-and-threshold’ technique (although he thinks it isn’t). I am surprised he didn’t rebrand ‘statistical cluster analysis’ as also something new and revolutionary just hot from computer science labs - yet another form of blessed combinatorics optimization. Where this idea has been usefully applied at KC: SAP Batch Workload time density – the degree of consolidation of batch work into the same time intervals.  In this case a boundary condition for workload ‘time placement’ would examine workload (demand) leveling and distribution to avoid unnecessary spikes for time-movable workloads.

Another idea I like:

He suggests that workloads are best characterized by their statistical properties, rather than “up front descriptions of their demand characteristics”. Thus workloads are ‘placed’ using segmentation of the resource demand profiles (to avoid imbalance, etc.). Which is to say, workloads are aggregations of activity with  common ‘demand characteristics’. In queuing theory, the classification of incoming transactions into resource-based profiles which are used for priority dispatching protocols against an array of appropriately resource mapped servers will always produce a more optimal process model in terms of throughput and average response times in contrast to a queuing network where transactions are not classified based on resource requirements. This was the basis for batch initiator job class definitions in the mainframe world of the 1970’s. It worked then also. It will work for ‘clouds’ too. 

The only ‘up front descriptions of demand characteristics’ that I know of would be the results of demand/performance modeling and/or LoadRunner-type benchmarking. This is still useful for ‘start-state’ sizing of the target landscape.

So…bottom line: interesting concepts or ‘new ways of conceptualizing’ the functional parametric states of virtualized landscapes. Suggestions (but no concrete explanations) that combinatorial optimization techniques can be utilized for capacity planning (implying it is not now being used). Interesting and useful applications for event densities and statistical profiling.

It seems so important, especially to vendor environments, to reinvent the wheel – a legacy object -   by their services or products, and suggest that they have superior knowledge of all things new and different and these new and different things are not ‘legacy’. After all, in vendor gadget technology what isn’t ‘new’ is ‘bad’ and ‘if it works, it’s out of date’. Thus, legacy means ‘bad’ because it’s  not ‘new’ (even if it uses new components) and, most importantly, it’s not what they are selling.

Just because “2 + 2 = 4” is legacy math, i.e. old, and thus bad, it doesn’t mean that it’s no longer true in cloud math.  It is still true, but needs to be repackaged.

So, in the interest of actionable market relevance, here is a new, fresh, cloud hyped- up version of “2+2=4”:

“It has been newly (re)discovered that ‘2 + 2 is optimally 4 and exceptionally relevant for business purposes.  The scope of this process is enhanced for sufficiently configured integer values of {2,4} in a dynamic web-enabled hi definition virtual presence wherein it has locality of reference within the set of all integer number segments of the arithmetic cloud infrastructure. This will provide a competitive edge to your business as newly revealed by the appropriate cloud-centric data mining tools (c1, c2, … cn, ) - with price guarantees, if you act now! -  at current release, version and maintenance levels in dynamic optimal adaptive combination. This fabulous offering is expertly administered under the guidance of cloud certified  analysts, at an attractive hourly rate, who are not now, nor ever have been, legacy experts and thus ‘new’ and ‘fresh’ with exciting social networking added value potential. (Please join us on the Facebook groupI like integer addition with cloud computing”).”
                                   

Of course, I might be preaching the choir (rather than the clouds) on this one. It seems, nevertheless, that corporate IT vendors demonstrate a kind of ‘math neurosis’:

A math-psychotic does NOT believe that 2+2=4.
A math-neurotic knows that 2+2=4 is true, but hates it. It must be repackaged for resale and aggressively marketed with a customer focused strategy.

If mathematics is the art of giving the same name to different things (J. H. Poincare), then IT marketing is the art of giving a new name to the same things and using pretty charts.

  
THE theologically orthodox  axiom for information technology services/product vendors:

"Absolutum Obsoletum"
(TimLatin translated: "If it works,  it’s out of date").









How to make 3 mice out of 2 mice by making 2 = 1:

















(Posted with the Tim's Browning permission)

Friday, December 10, 2010

The Exception Value Concept to Measure Magnitude of Systems Behavior Anomalies


The Exception Value concept was introduced in my 1st CMG paper in 2001 (see the last link in the first post of this blog).  I have found later that this EV approach can be used for trends recognition and thier separation in the historical data as described in my 2008 paper: Exception Based Modeling and Forecasting.

Then I have noticed some other vendors started using similar concept (See my last year post about that Exception Value (EV) and OPNET Panorama) ...

The last news about that concept is following.

At CMG'10 conference I met BMC software specialist Dima Seliverstrov and he mentioned of referencing my 1st CMG'01  paper in his CMG presentation (scheduled to be presented TODAY!).  I looked at his paper "Application of Stock Market Technical Analysis Techniques to Computer System Performance Data" (abstract is linked here) and indeed he showed the interesting way to use my EV technique to evaluate stock market deviations to automate some brokerage processes! Here is the paragraph from his paper about it:

"Buy or sell signals are generated when the daily value moves outside of the error bars. It’s not only important to identify which systems have buy and sell signals, but which systems to look at first. A useful approach to rank the signals from multiple sources is to calculate the area outside the error bars and rank based on the area [4].  For example if one systems disk space exceeded area is 100 Gbytes outside and another system is 1 Kbyte you would look at the system with a larger area first. Another useful technique for CPU Utilization is to normalize the area outside the envelope by converting to SPECint...

By the way, I remember that my 1st paper also suggested to do the similar normalization but not based on SPECint benchmark (I know that metrics is used by BMC as the main sizing factor and it is fine), but more efective and most difficult to obtain is TPC (http://www.tpc.org/) benchmark. 
Here is the figure from my CMG'01 paper (sorry for the bad quality...)



Anyway I am pleased that my idea is alive!

Below is some other my postings with EV idea discussions:
Feb 28, 2009
Dec 29, 2009
Jan 24, 2009
Jun 21, 2010

Wednesday, December 1, 2010

Cloud Computing Capacity Management

Interesting that couple years ago I was job-interviewed by Google for Program Manager position and on the last phone interview I was asked about how to do capacity management for cloud computing. I did not really  know that...
 (I did not have any deal with that yet - only CMG based knowledge - see
C. Molloy's  presentation:

Capacity Management for Cloud Computing

... but tried to tell them that the generic approach should be applied considering a cloud as just a highly virtualized infrastructure with very high mobility feature to satisfy any additional capacity demand and on almost a fly.  Cloud is just the next level of virtualization. Right?

And my favorite smart alerting (based on dynamic thresholds) approach could automate the finding a moment when additional capacity needs to be allocated. (I think I mentioned that in one of my papers). As for as I know, currently it is done based on strictly static thresholds.

The figure is the chart about capturing capacity usage change happend in VMware environment (Control chart is for VM, trend is for Host) - that is from my new CMG'10 paper: IT-CONTROL CHART 

BTW I failed and did not get the offer from Google, but anyway my family was not really ready to change the coast and I just decided that was a test for Google and they failed, not me!

(see more recent post about cloud compuing here:

Tim Browning: the review of cloud computing article "Optimal Density of Workload Placement")

Monday, November 15, 2010

My CMG'10 presentation - "IT-Control Charts"

I will go to CMG conference this time only for one day just to present my paper "IT-Control Charts" on Wednesday December 8th 10:30 - You are WELCOME!

Check it in the CMG conference agenda  - http://www.cmg.org/cgi-bin/agenda_2010.pl?action=more&token=5030

For Russian readers (Информация по русски здесь) I made a posting about that event in my Russian mirror blog: http://ukor.blogspot.com/2010/11/cmg10_15.html

Friday, November 5, 2010

CMG'09: Performance Data Statistical Exceptions Analysis (Review)

  1. The best CMG'09 conference (www.CMG.org) paper award was granted to the following paper:
Survival Analysis In Computer Performance Analysis by
Brian Barnett, Perry Gibson, and Frank Bereznay

That paper has a deep discussion about normality of performance data, showing examples where MASF approach does not work. The Survival Analysis that does not require any knowledge of how data is distributed was suggested to be used in those cases.

  1. Sunday workshops
a. I ran my workshop there (see My CMG'09 Sunday Workshop) with good attendance (~20 attendees) and interest expressed by audience.This year CMG'10 conference will have my new paper get published ("IT-Control Charts"), which is based on my CMG'09 workshop.

b. Other interesting workshop which I have attended was “R – An Environment for Analyzing and Visualizing (Performance) data” by Jim Holtman, who also published and presented his paper at the same conference: The Use of R for System Performance Analysis. That was excellent topic as R is a free tool that could replace expensive ones like SAS. (see my R code to build control charts example here: Power of Control Charts).

  1. Other interesting papers
    1. The most interesting paper related to this blog subjects was:
“How ‘Normal’ is your IT data?”  by  Mazda A. Marvasti, Ph.D., CISSP Integrien Corporation

I had already published some information about Integrien tools (see Real-Time Statistical Exception Detection). The paper was good illustration and explanation of why a performance tool needs to get ability to correctly work with non-normally distributed data: “the behavior of IT data, across a variety of collection sources and data types, does not resemble normal distribution….

    1. The following author had reference to my work  in his paper:
“Lean Monitoring Framework For eBusiness Applications” by  Ramapantula Udaya Shankar

Monday, October 18, 2010

Statistical Process Control to Improve IT Services - one more CMG'10 paper related to this blog subject

Using Statistical Process Control to Improve the Quality and Delivery of IT Services
Nathan Shiffman
Armin Roeseler, Townsend Analytics
Mike Pecak
This session presents a framework for the delivery of IT services based on Continuous Quality Improvement (CQI). Starting with the Capability Maturity Model (CMM), we develop a process oriented approach based on Statistical Process Control (SPC). We apply the framework to the Change Management process of a large IT environment for a trading software firm, and show how failure-rates of the Change Management process were reduced dramatically.

Monday, September 13, 2010

SEDS elements in the Fluke VPM (Application Performance Management tool)

I have just received 2-day training of Fluke VPM tool.   I have already mentioned in my other posting:  
"Baselining and dynamic thresholds features in Fluke and Tivoli tools" Below are my additional comments about the tool. 
  • They have the same approach as our SEDS has - to provide at the application performance status the list of business applications with most unusual response time. And they use a hit chart for that which similar SEDS used ( seeCMG'07 trip reportand the tree-map)
  • Smart alerts and the statistical filtering are used only for response time metric and the alert is issued only based on dynamic upper-limits.
  • Learning period (base-line) is "sliding" just like in main SEDS mode, but it  is based only three weeks raw data history and looks like not grouped by hour-weekdays (like SEDS does) but maybe grouped by work- and off- hours (need to check).
  • I have suggested to the VPM trainer that the statistical filtering could be applied to transaction volume metric as well and not only upper-limits, but lower-limits should be used as unusual low transaction rate needs to be be captured as a potentially bad issue.


 All in all I was impressed by the way they implemented basic SEDS principals to filter application performance metrics. I have suggested to do that in my following CMG paper in 2006  in finally that was done!  "SYSTEM MANAGEMENT BY EXCEPTION, PART 6" (can be found in the posting: 


Also that my paper suggested to use heat chart (tree-map) against network metric too: 

"...For Network devices, the bandwidth utilization can be tree-mapped. Figure shows an example of a Network tree-map. Color coding in this report could be based on exceeding constant thresholds or statistical control limits (SEDS based). Each small box represent a device (size could be indicative of relative capacity, e.g. 1 GB or 100 MB network) and a big outline box could represent a particular application or site (e.g. building)..."



Tuesday, July 27, 2010

My new CMG'10 paper "IT-Control Charts" was accepted to be published and presented

My new 10th CMG paper "IT-Control Charts" was accepted and will be presented in Orlando on December 8th Wednesday at 10:30.  That is a paper version of my Workshop "Power of Control Chart, How to Read, How to Build, How to Use" I ran several times last year.

    IT-Control Charts
    The Control Chart originally used in Mechanical Engineering has become one of the main Six Sigma tools to optimize business processes, and after some adjustments it is used in IT Capacity Management especially in “behavior learning” products. The paper answers the following questions. What is the Control Chart and how to read it? Where is the Control Chart used? Review of some performance tools that use it. Control chart types: MASF charts vs. SPC; IT-Control Chart for IT Application performance control. How to build a Control Chart using Excel for interactive analysis and R to do it automatically?

Wednesday, June 30, 2010

IT-Control Chart against Network Traffic Data vs. Process Level Data Chart


The 1st implementation of applying SEDS methodology against network trafic data was published at my CMG'07 presentation (SYSTEM  MANAGEMENT  BY  EXCEPTION:
The Final Part):



Process level data chart could also be used together with the control charts to find which particular process is responsible for some unusual spikes. The figure above shows how the CPU usage by processes chart can be used to explain that incremental daily back-up causes small daily spikes on the control chart of Network traffic (NIC level). The full back-up caused one big spike per week expanding activity to work hours, which could be dangerous to interfere with other DB2 on-line workload on that server.



Monday, June 21, 2010

Industrial Robot Grasping Processes Research. EV prototype was there!

I have published  in my Russian blog here  the abstract of my  PhD dissertations  (1986 )

Research of Industrial Robot Grasping Processes 


The main idea of that work was to find the way to calculate a some set of initial  grasping (or assembling) object coordinates that would warrant successful grasping (or assembling) process (operation). I called that "Area of Normal Functioning - ANF" (Область Нормального Функционирования -ОНФ). If grasping or assembling process starts with parameters (or coordinates) that are not in that area (NFN), the process will be failed. That area defined coordinates where passive (natural) adaptation would work. Interesting that that robotic subject is still active - see the following link 


Underactuated hand with passive adaptation

Rereading that my old work I suddenly figured out that my resent idea of Exception Value (EV - area between statistical limits and just happened actual variables values) is very similar with that my very old idea of calculating limits for successful assembling or robot grasping processes!


Apparently my mind works very consistently.... 


   

Wednesday, May 26, 2010

CMG'10 Interesting Paper Abstracts

CMG.org published the following interesting paper abstracts for the upcoming 2010 national conference that looks like are related to the subject of this blog:

    IT-Control Charts
    The Control Chart originally used in Mechanical Engineering has become one of the main Six Sigma tools to optimize business processes, and after some adjustments it is used in IT Capacity Management especially in “behavior learning” products. The paper answers the following questions. What is the Control Chart and how to read it? Where is the Control Chart used? Review of some performance tools that use it. Control chart types: MASF charts vs. SPC; IT-Control Chart for IT Application performance control. How to build a Control Chart using Excel for interactive analysis and R to do it automatically?

Effective Proactive Service Capacity Management using Adaptive Thresholds
Proactive Service Capacity Management (SCM) is essential to deliver desired service level to its users. It enables service to be available as per the SLA with agreed service performance. For proactive SCM, monitoring and proper thresholds are basic ingredients. Often service usage is seasonal in nature and setting fixed thresholds for alerts can’t take into account the variability of usage and significance of alerts. This paper is aimed at introducing the concept of adaptive thresholds and discusses how this should be utilized to proactively manage e2e service perf and capacity aspects. 
Using Statistical Process Control to Improve the Quality and Delivery of IT Services
This paper presents a framework for the delivery of IT services based on Continuous Quality Improvement (CQI). Starting with the Capability Maturity Model (CMM), we develop a process oriented approach based on Statistical Process Control (SPC). We apply the framework to the Change Management process of a large IT environment for a trading software firm, and show how failure-rates of the Change Management process were reduced dramatically. 


Thursday, May 20, 2010

Baselining and dynamic thresholds features in Fluke and Tivoli tools

I have just attended two demo sessions about Fluke VPM and Tivoli Monitoring tools and both presentations included the following new features related to performance Exception Detection technology:

1. Fluke (NetFlow Tracker - http://www.skomplekt.com/pdf/PerformanceAndScalability(fnet).pdf)
- Baselined alarms trigger when normal usage is exceeded.
- Automatically choose a threshold using a baseline, or manually specify.
- Tracker baselines individual elements of a report, not just the total.
- Baseline can be static (learn once) or update weekly (learn every week).

2. Tivoli Monitoring Version 6.2.2 ( http://publib.boulder.ibm.com/infocenter/tivihelp/v15r1/index.jsp?topic=/com.ibm.itm.doc_6.2.2/new_version622.htm)
- The bar chart, plot chart, and area chart have a new "Add Monitored Baseline" tool for selecting a situation to compare with the current sampling and anticipated values. The plot chart and area chart also have a new "Add Statistical Baseline" tool with statistical functions. In addition, the plot chart has a new Add Historical Baseline tool for comparing current samplings with a historical period.
- Situation overrides for dynamic thresholding.

That is interesting as I personally discussed with Tivoli team possibility to add SEDS like features to Tivoli tools just before I left IBM (about 3 years ago). It looks like they finally have implemented some elements of that technology!

I will be playing with all those features soon and plan to add more comment about how that works.

Tuesday, April 13, 2010

Disk Subsystem Capacity Management - my CMG'03 paper - "Health Index" metric and Dynamic Thresholds

Here is the link to my CMG'03 paper:  http://www.cmg.org/proceedings/2003/3099.pdf
(Free download but registration is required)
Presentation slides are freely available here:
Disk Subsystem Capacity Management, Based on Business ... - CMG

1. The paper showed interesting way to report Disk Space usage via BMC Perceive:


2. In the paper there is example of using some interesting "Health Index"  metric. I just took it from Concord (now it is CA product, I believe) performance data collector as one of many performance metrics.


Based on Concord eHeallth tool documentation:

“System Health Index” is the sum of five components (variables):
–SYSTEM, which reports a CPU imbalance problem;
–MEMORY, which is exceeding some memory utilization threshold or reflects some paging and/or swapping problems;
–CPU, which is exceeding some utilization threshold;
–COMM., which reports network errors or exceeding some network volume thresholds;
–And STORAGE, which might be a combination of
a. Exceeding user partition utilization threshold;

b. Exceeding system partition utilization threshold;

c. File cache miss rate, Allocation failures and

d. Disk I/O faults problem that can add additional points to this Health Index component.

I used that long ago. Currently in my environment I do not have that collector.
But I have started calculating my own way of "health index", which is based on numbers and types of exceptions (e.g. Hot ones are defects like run-aways; warning ones are just severe deviations from statistical norms; also number of hours/days with exceptions that does matter). Filtering that by applications (using CMDB) it gives you an idea of how stable the application is. In my other papers there are some elements of that approach.

2011 update: Other important  idea is in the paper is Dynamic Thresholds usage suggestion as for high level I/O related metrics there are no natural thresholds. Dynamic  Thresholds  got recently popular but I introduced that long ago!

Capturing Workload Pathology By SEDS - my CMG'05 paper

The paper can be found here: https://www.researchgate.net/publication/221447101_Capturing_Workload_Pathology_by_Statistical_Exception_Detection_System
Here is the resume:
Problem definition: The Servers workload pathology  (defects) such as run-away processes and memory leaks captures spare server resources and causes the following issues:
- being a parasite type of workload they compete for the resources with the real workload and causes performance degradations;
- they mimic capacity issue, but they are not a real capacity problem and just spoil the historical sample and causes wrong capacity trends as seen on the Figure below:

To fight this problem I have developed the way to capture those defects, report on them and then to remove them from historical sample to see real capacity trends. That was implemented as a part od SEDS application. Detailed explanations are in my CMG'05 paper  "Capturing Workload Pathology by Statistical Exception Detection System"
"Capturing_Workload_Pathology_by_Statistical_Exception_Detection_System)

Other good result of implementing this problem resolution was dramatic reduce number of incidents related to run-away and memory leaks defects. The chart below shows 2+ time reduction for 2 years:


Other work in this area made by Ron Kaminski. See CMG paper here:

Automating Process and Workload Pathology Detection


presentation slides:  

Automating Process and Workload Pathology