Popular Post

Search This Blog

Tuesday, December 27, 2016

How Bayesian inference works. Can that improve the SPC?

What about using that approach for SPC?  At least one article is HERE about it:
Bayesian Statistical Process Control
 and some dispute is HERE:

Why isn't bayesian statistics more popular for statistical process control 

Note there is a "Bayesian statistical process control chart"...

In my experience we often do not have too many data to be used for mean, UCL and LCL  calculations, so based on "An Application to Bayesian Methods in SPC": 

I plan to explore this more....

Wednesday, December 7, 2016

The BANK should be a tech company to win the market

I work for Capital One bank as IT SME and naturally I support the direction it goes. Friends keep asking me how the bank could be a tech company like Netflix? The best who can answer that question is the CIO of Capital One. HERE IT IS:

Capital One rides the cloud to tech company transformation

I finally am getting  use to a new style of workplace I have now:
The picture from the article linked above

Friday, December 2, 2016

CMG'16 (#imPACt) aftershocks: Could we WAC the #Cloud? or how to build cube of cubes

One of the finding from my CMG'16 conference attending and speaking with the vendor's representatives is the 6fusion  way to measure systems overall status by WAC - Workload Allocation Cube. 

"The tool measures the resource consumption of AWS compute (EC2) instances, along with Elastic Block Storage (EBS) volumes. It does this via 6fusion's "Workload Allocation Cube" (WAC) technology, which works by measuring datapoints that include CPU utilization, disk utilization, storage capacity, and disk, WAN, and LAN IOPS. This information is aggregated through its WAC technology to output a single value that reflects the performance and resource use of an app." http://www.theregister.co.uk/2013/07/29/6fusion_workload_allocation_cube/

My comment so far is following:

Long ago I tried to use "System Health Index" (from Concord eHeallth performance monitor) to estimate the system usage based on 5 main subsystems measures and published my thoughts about it in my CMG'03;06;07 papers; E.G.  see some details in the following post.

Disk Subsystem Capacity Management - my CMG'03 paper - "Health Index" metric and Dynamic Thresholds

Having that metric recorded I have suggested (in my CMG'07 paper) to use the Tree-map report against that to get one  overall health check picture of numerous systems.

Also I have applied my anomaly detection technique (SEDS) against that Health Index metric to detect at once any abnormal cases across all main 5 subsystems:

I have published a few examples in my CMG'06  SYSTEM MANAGEMENT BY EXCEPTION, PART 6 

Anyway I see  some similarity between modern "WAC" and the olde good "Health Index". Do you see it as well?

P.S. One more thing.... WAC is a cube and the SEDS data (SEDS profile) in the picture above is also a data cube that represent a "signature" of the object (server in the case). Actually if the Health index a cube also, the SEDS Health index profile would be a cube of cubes....!? ,

Anyway here is the discussion how that SEDS profile data cube can be built using the open source tools (my most visited post in this blog, btw...):

One Example of BIRT Data Cubes Usage for Performance Data Analysis 

Tuesday, November 29, 2016

Invitation to the Advanced Software in #Robotics conference in LIEGE (Belgium-1983) to present my paper

I was a co-author of the paper "Mathematical simulation of tasks of robot operation accuracy and readability". See the session 4 in the agenda above. (Note my initial is misspelled as C.A. TRUBIN , should be I.A. TRUBIN).

That was USSR time and in spite we were invited and even sent our paper translated into English, instead of us some communist functioner went there and even did not appear at the conference... So it was not really published...
I have found the  online documents so far related to this - http://ieeexplore.ieee.org/document/4336367/
- http://dl.acm.org/citation.cfm?id=577664
https://www.amazon.com/Advanced-Software-Robotics-International-Proceedings/dp/0444868143 (where the preceedings could be bought!)

Saturday, November 12, 2016

Help us generate content for the CMG blog -LinkedIn CMG group discussion

Be the hero of CMG, write a blog post! - Renato Bonomini

At the #CMGimPACt conference, did you learn something that you'd like to share? - Todd Minnella

Share your learnings, they'll make a great blog post! Did you leave with more questions than answers? Share your questions, they will be our content ideas!

Do you have a crazy idea? Let us know! - Melanie Heimer

Share your ideas in the comments, or contact me or Igor Trubin directly!

Sunday, November 6, 2016

Sitting on the Board of Directors (www.CMG.org)


I am at CMG'2016 conference (imPACt) right now and going to attend the following session:


by Alexander Gilgur, Steve Politis (Facebook, Inc.)

We often talk about performance and capacity as one thing, and indeed they complement each other in a powerful balancing loop: higher capacity improves performance, decreased performance indicates insufficient capacity, which needs to be provisioned for. However, we often miss the fact that measurement and aggregation approaches that are used in performance monitoring are not always useful for capacity planning, while approaches that we use in capacity planning are often meaningless for performance analysis. This paper explores this gap and discusses ways to reconcile the two tasks.

I really appreciate they have mentioned my following work in their paper:

Trubin, I. (2006) System Management by Exception. Presented at the Annual International Conference of the Computer Measurement Group (CMG 2006). Reno, NV. December 2006.

But I feel like my later publication was more relevant to the subject:

Exception Based Modeling and Forecasting - CMG'2008


How often does the need arises for modeling and forecasting? Should it be done manually by ad-hoc, by project requests or automatically? What tools and techniques are best for that? When is trending forecast enough and when is a correlation with business drivers required? The answers to these questions are presented in this session. The capacity management system should automatically provide a small list of resources that needs to be modeled or forecasted; a simple spreadsheet tool can be used for that. This technique method is already implemented on the author’s environment with thousands of servers.
I am looking forward meeting the presenter (my long time CMG friend Alex) to discuss this....

Tuesday, November 1, 2016

Southern CMG meeting presentations are published (#AppDynamics, #Cibra, #IBM, #DataKinetics, #MetLfe,#Fidelity and me) - #CMGnews

The Southern Computer Management Group (SCMG)  held the very successful meeting with 40 attendees,

Main presentations (including mine) were uploaded to the site:

- AppDynamics: “The differing ways to monitor and instrument”  LINK TO PRESENTATION
- DataKinetics: “Best Practices - Populating Big Data Repositories from DB2, IMS and VSAM” LINK TO PRESENTATION
- Cibra: “Capacity Management for Hybrid IT”  LINK TO PRESENTATION
Igor Trubin: "Is Your Capacity Available?” LINK TO PRESENTATION
MetLife: "Capacity Management Is Still Relevant” LINK TO PRESENTATION
Fidelity: Too Big to Test: Breaking a production brokerage platform without causing financial devastation” LINK TO PRESENTATIION

Full Agenda is here

Friday, October 21, 2016

Interesting 1992 conference paper about SPC and machine learning written by Shewhart.

Interpreting statistical process control (SPC) charts using machine learning and expert system techniques

Conference Paper · June 1992

Statistical process control (SPC) charts are one of several tools
used in quality control. The SPC quality control tool has been
under-utilized due to the lack of experienced personnel able to identify
and interpret patterns within the control charts. The Special Projects
Office of the Center for Supportability and Technology Insertion (CSTI)
has developed a hybrid machine-learning and expert-system software tool
which automates the process of constructing and interpreting control
charts. The software tool draws control charts, identifies various chart
patterns, advises what each pattern means, and suggests possible
corrective actions. The application is easily modifiable for process
specific applications through simple modifications to the knowledge base
portion using any word processing software. The authors discuss control
charts, software functionality, software design, machine learning, and
the expert system

Thursday, October 20, 2016

SCMG meeting in Cary, NC

I enjoy listening and presenting!

Monday, October 10, 2016

Interesting paper about "Adaptive Anomaly Detection in Cloud"

Adaptive Anomaly Detection in Cloud using Robust and Scalable Principal Component Analysis 


This paper proposes a novel and scalable model for automatic anomaly detection on a large system such as a cloud. Anomaly detection issues early warning of unusual behavior in dynamic environments by learning system characteristic from normal operational data. Anomaly detection in large systems is difficult to detect due heterogeneity, dynamicity, scalability, hidden complexity, and time limitation. To detect anomalous activity in the cloud, we need to monitor the datacenter and collect cloud performance data. In this paper, we propose an adaptive anomaly detection mechanism which investigates principal components of performance metrics. It transforms the performance metrics into a low-rank matrix and then calculates the orthogonal distance using the Robust PCA algorithm. The proposed model updates itself recursively learning and adjusting the new threshold value in order to minimize reconstruction errors. This paper also investigates the robust principal component analysis in distributed environments using Apache Spark as the underlying framework, specifically addressing cases in which a normal operation might exhibit multiple hidden modes. The accuracy and sensitivity of the model is tested on Google data center traces and Yahoo! datasets. The model achieves an 87.24% accuracy.
MY COMMENT: By the way the paper has referenced to MASF technique which I have enhanced and have been using (check my SETDS methodology)  for years  to capture anomalies  (exceptions) and sudden short term trends against huge server farms (20,000+ servers) including private and public clouds. Note my way is much-much simpler and in spite the MASF has indeed a high rate of false positives, SETDS has the way to handle that well.

Friday, October 7, 2016

Southern CMG meeting in Cary, NC on October 20th - Final Agenda

The SCMG is proud to announce our Fall 2016 Meeting, an all-day event on October 20, 2016 

MetLife – Grace Hopper Auditorium (Bldg. 1 (MET 1), Floor 01, Room 600)
101 MetLife Way
Cary NC, 27513

ð  HURRY:  REGISTER NO LATER THAN OCT. 14!!  This covers Breakfast and Lunch.   We need to confirm the number of Registrants so that we can properly plan the catering.  We also need your Registration information in order to get a list to MetLife Security so that we will have visitor badges ready for you the morning of Oct. 20 when you arrive in the MET1 lobby.

To REGISTER, use the  registration page.  You can use a PayPal account or credit card.   Your registration payment through the PayPal button logs your registration.


8:00-8:45 ET
Registration / Breakfast
Speaker BIO
Breakfast provided with Sponsor Session:  AppDynamics
Rick Weaver

“Best Practices - Populating Big Data Repositories from DB2, IMS and VSAM”
Over the past 25 years, Rick Weaver has become a well-known mainframe expert specializing in database protection, replication, recovery and performance. Because of his vast expertise, he has authored numerous articles, whitepapers and other valuable pieces on database technologies, and frequently spoken on the subjects of database recovery and performance at conferences, symposiums and user groups.
Ann Dowling

“Capacity Management Is Still Relevant”
Ann joined MetLife in April 2016 as the Director of Capacity & Forecast Engineering.  In this role Ann will build on her extensive background working at IBM in various disciplines including capacity planning, process architecture, performance engineering, and offering management.  Her professional passion is Capacity Planning in support of the business and how it drives the applications that consume resources on the IT infrastructure.  Ann’s most rewarding work has been leading teams to consolidate toward a common ‘best practices’ approach to capacity management.  She did so for a series of consolidations of independent data centers within IBM which evolved into her role as the global Capacity Management process owner.  That work grounded Ann’s move to consulting services with external, non-outsourced customers to evaluate their capacity management capabilities, identify strengths and gaps to then build a roadmap for improvement.   The next step was working with a specific, large account to lead a team on the implementation phases of the roadmap that gave Ann a more hands-on role working directly with the engineering and operations teams and management.  She has been on the planning committee and speaker for various IBM and CMG technical conferences.  She was an instructor for IBM’s Architecting for Performance class and author of a four-part series on “Exploring Analytics to enable the Business and Service Value of Capacity Planning”.
Kyle Parrish
CMG 2015 Mullen Award Winner

“Too Big to Test: Breaking a production brokerage platform without causing financial devastation”
Kyle currently works as a Director of Technology Risk in the FI Information Security group at Fidelity Investments.  Kyle joined Fidelity in January of 2011 as a Director of Performance Architecture charged with driving end-to-end testing of the Fidelity Brokerage systems.  Prior to joining Fidelity, Kyle worked as a consultant for over 13 years, after a career in both the private sector and a university research setting.  Kyle’s roles have spanned everything from program management to performance engineering to security, across industries as varied as airlines, financial services, manufacturing, retail, pharmaceuticals, and state government.  
LUNCH provided with Sponsor Session:  Cirba
Igor Trubin

“Is Your Capacity Available?”
I started my career in 1979 as an IBM/370 system engineer. In 1986 I got my PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching there CAD/CAM, Robotics and Computer Science for about 12 years. I published 30 papers and made several presentations for international conferences related to the Robotics, Artificial Intelligent and Computer fields. In 1999 I moved to the US and worked at Capital One bank in Richmond as a Capacity Planner. My first CMG paper was written and presented in 2001. The next one, "Global and Application Level Exception Detection System Based on MASF Technique," won a Best Paper award at CMG 2002 and was presented again at UKCMG 2003 in Oxford, England. My CMG 2004 paper about applying MASF technique to mainframe performance data was republished in the IBM z/Series Expo. I also presented my papers in Central Europe CMG conference and in numerous US regional meetings. I continue to enhance my exception detection methodologies. After working more than 2 years as the Capacity Management team lead for IBM, I had worked for SunTrust Bank for 3 years and then got back to IBM holding for 2+ years  Sr. IT Architect position. Currently I work for Capital One bank as IT Manager for IT Capacity Management group. In 2015 I have been elected to the CMG (http://www.cmg.org) board of directors. Blog: www.Trub.in
Shawn Lundvall

“zBNA: Theory and Overview”
I started my IBM career in 2001 in Poughkeepsie in the Systems Architecture group writing the Principals of Operations. In 2005 I got the opportunity to do hardware design of the fixed point unit. In 2007 I moved to Richmond supporting clients as a Client Technical Specialist. In 2013 I joined the Washington Systems Center as a Software Engineer and am now a developer for zBNA and zPCR.
Ken Christiance

“IBM GTS Cirba Case Study “
Ken Christiance – Distinguished Engineer with 28 years’ experience in IBM.   He has been working in Strategic Outsourcing field since 1993; experience that spans service management architectures, virtualization/server management and analytics.  Ken is currently a member of the Technology, Innovation and Automation team that supports architecture and solution design for system automation, virtualization and distributed server management.  Ken is patented and published for technologies that provide usage accounting and billing, policy based automation, network design, virtualization and service management tooling
SCMG Committee Meeting
Optional:  a breakout room will be reserved for folks who would like to hold an impromptu BOF or post SCMG informal opportunity to network.

Friday, September 9, 2016

41st International Performance & Capacity Conference

Make an imPACt in 2016. Join us in La Jolla, CA this November

My presentation "Is your Capacity Available is scheduled to be presented at International CMG conference "imPACt 2016" in La Jolla, CA on Monday November 7th, 5-6pm. The abstract can be seen HERE

Please consider to attend!

My new presentation is scheduled at the Southern CMG meeting in Cary, NC on October 20th

I am glad to announce that our next SCMG meeting will be held on October 20th in Cary, NC.
See agenda and location details at the SCMG web page:

Tentative AGENDA:

8:00-9:00 ETRegistration / Breakfast
9:00-9:30Sponsor Session
9:30-10:30Andrew Armstrong “tbd”
10:30-11:30Ann Dowling “Capacity Management Is Still Relevant
11:30-12:30Kyle Parrish “Too Big to Test: Breaking a production brokerage platform without causing financial devastation”, CMG 2015 Mullen Award Winner
12:30-1:30LUNCH with Sponsor Session:  Cirba
1:30-2:30Igor Trubin “Is Your Capacity Available?
2:30-3:30Shawn Lundvall “zBNA: Theory and Overview
3:30-4:30Ken Christiance “tbd”
4:30-5:30SCMG Committee Meeting

Note I am presenting my new white paper there that is also scheduled to be published and presented at International CMG conference "imPACt 2016" in La Jolla, CA on Monday November 7th, 5-6pm. The abstract can be seen HERE

Please consider to attend both events!

Wednesday, August 24, 2016

CMG #imPACt and #Velocity conferences cover the Capacity Management shift

I am analyzing the content of the upcoming CMG conference. ( imPACt 2016 )
This one is interesting for Capacity Planners:

Speaker: Ann Dowling
Company: MetLife
Session Title: A shift in who does capacity sizings 

Session Abstract:

The accelerating advance of hybrid infrastructures with promotion of self-service requests for capacity is causing a significant shift in who is responsible for sizing capacity requirements. The shift is moving out of the direct management by IT infrastructure capacity planners out to the end user or consumer - often application owners and development teams. This can be viewed as a positive shift that activates the linkage between infrastructure and application teams. The challenge is to ensure the requestors have the skills and tools to adequately size their requirements for cpu, memory, and storage for both their immediate needs and with an understanding of workload growth patterns along with the financial implications. This presentation will focus on the organizational and cultural shifts that result from the emergence and popularity of self-service capacity sizings and requests.

Even more interesting that the upcoming Velocity conference also covers this subject: 

Speaker: Kevin McLaughlin
Company: Capital One
Session Title: Is capacity management still needed in the public cloud?

Session Abstract: 

The cloud holds the promise of bottomless capacity, available instantly. Recently, Capital One has been shifting a significant portion of its workload to the public cloud. Kevin McLaughlin explores what capacity management looks like in the cloud, which old concepts still apply, which should be retired, and what new metrics become important and covers the importance of performance management. Kevin also outlines what needs to be monitored as workloads transition to the cloud and what to monitor once a workload is fully in the cloud, as well as considerations for ensuring the legacy environment maintains sufficient capacity during the transition.

BTW I know very well both presenters. Follow this post for details!

Monday, August 1, 2016


The 11th International Conference on Availability, Reliability and Security (“ARES”) will bring together researchers and practitioners in the area of dependability. ARES will highlight the various aspects of security - with special focus on the crucial linkage between availability, reliability and security.

Interesting..., but I do not see any topics about the availability and capacity interconnection in this forum...

Thursday, July 28, 2016

Adrian Cockcroft's Blog: My CMG paper on Crunching Data In the Cloud is pub...

Adrian Cockcroft's Blog: My CMG paper on Crunching Data In the Cloud is pub...: The slides are also available at http://www.slideshare.net/adrianco/crunch-your-data-in-the-cloud-with-elastic-map-reduce-amazon-emr-hadoop ...

Friday, July 8, 2016

Performance Problem Diagnosis in Cloud Infrastructures

I have been notified by RG about new reference to my paper "Capturing Workload Pathology by Statistical Exception Detection System". The following interesting tithes referenced my work:

 Umeå University

Cloud datacenters comprise hundreds or thousands of disparate application services, each having stringent performance and availability requirements, sharing a finite set of heterogeneous hardware and software resources. The implication of such complex environment is that the occurrence of performance problems, such as slow application response and unplanned downtimes, has become a norm rather than exception resulting in decreased revenue, damaged reputation, and huge human-effort in diagnosis. Though causes can be as varied as application issues (e.g. bugs), machine-level failures (e.g. faulty server), and operator errors (e.g. mis-configurations), recent studies have attributed capacity-related issues, such as resource shortage and contention, as the cause of most performance problems on the Internet today. As cloud datacenters become increasingly autonomous there is need for automated performance diagnosis systems that can adapt their operation to reflect the changing workload and topology in the infrastructure. In particular, such systems should be able to detect anomalous performance events, uncover manifestations of capacity bottlenecks, localize actual root-cause(s), and possibly suggest or actuate corrections.
This thesis investigates approaches for diagnosing performance problems in cloud infrastructures. We present the outcome of an extensive survey of existing research contributions addressing performance diagnosis in diverse systems domains. We also present models and algorithms for detecting anomalies in real-time application performance and identification of anomalous datacenter resources based on operational metrics and spatial dependency across datacenter components. Empirical evaluations of our approaches shows how they can be used to improve end-user experience, service assurance and support root-cause analysis.

Wednesday, June 22, 2016

Scryer: Netflix’s Predictive Auto Scaling Engine (repost with my comment)

This type of pattern is not a big deal to predict, unfortunately we usually have to deal with much more noisier data...  applying SETDS methodology successfully.

Sunday, May 22, 2016

VMware uses Cloud Infrastructure Anomalies and Trends Detection Approaches

I have a few posts related to Mazada Marvasti work, for instance:

CMG'09: Performance Data Statistical Exceptions Analysis 

I actually still use time to time the product which is based on his work (former Alive tool) and I see also that he has been publishing recently a lot of papers and patents. The following paper from VMware published on RG has a long list of references to his work. That is very interesting and I should look at them more closely. Interesting that the paper declares that Mazda's anomaly and trend detection technique now are used for a cloud based infrastructure! 



Tuesday, May 3, 2016

Wednesday, April 27, 2016

Neural Network in simple words

"This both confuses what a neural network actual is, and makes some people question their merits because they expect them to act like brains, when they are really a fancy type of function.
The best way to understand a neural net is to move past the name. Don't think of it as a model of a brain... its not... this was the intention in the 1960s but its 2011 and they are used all the time for machine learning and classification.
A neural network is actually just a mathematical function. You enter a vector of values, those values get multiplied by other values, and a value or vector of values is output. That is all it is.
They are very useful in problem domains where there is no known function for approximating the given features (or inputs) to their outputs (classification or regression). One example would be the weather - there are lots of features to the weather - type, temperature, movement, cloud cover, past events, etc - but nobody can say exactly how to calculate what the weather will be 2 days from now. A neural network is a function that is structured in a way that makes it easy to alter its parameters to approximate weather predication based on features.
Thats the thing... its a function and has a nice structure suited to "learning". One would take the past five years of weather data - complete with the features of the weather and the condition of the weather 2 days in the future, for every day in the past five years. The network weights (multiplying factors which reside in the edges) are generated randomly, and the data is run through. For each prediction, the NN will output values that are incorrect. Using a learning algorithm based in calculus, such as back-propogation, one can use the output error values to update all the weights in the network. After enough runs through the data, the error levels will reach some lowest point (there is more to that, but I won't get into it here - most important is over fitting). The goal is to stop the learning algorithm when error levels are at a best point. The network is then fixed and at this point it is just a mathematical function that maps input values into output values just like any old equation. You feed new data in and trust that the output values are a good approximation.
To those who claim they are failed: they aren't. They are extremely useful in many domains. How do you think researchers figure out correlations between genes and diseases? NNs, as well as other learning algorithms, are used in bioinformatics and other areas. They have been shown to produce extremely good results. NASA now uses them for space station routines, like predicting battery life. Some people will say that support vector machines, etc are better... but there is no evidence of that, other algorithms are just newer.
It is really too bad people still make this claim that neural networks are failed because they are much simpler than the human brain --- neural networks are no longer used to model brains --- that was 50 years ago. .."

Source - http://programmers.stackexchange.com/questions/72093/what-is-a-neural-network-in-simple-words