System Management by Exception: anomaly detection

Showing posts with label anomaly detection. Show all posts

Monday, March 28, 2016

#Cloud of #containers, #dockers and #microservices requires Management by Exception

From "The Challenge of Monitoring Containers at Scale"

...."Monitoring systems generally rely on the operator to define ‘normal’. With the rate of change in today’s dynamic environments being driven by auto-scaling and scheduled infrastructures, defining normality becomes a challenge. So far the monitoring community has done a great job of focusing on automating metrics collection and alerting on those predefined thresholds. We now need to focus on algorithmically detecting faults or anomalies and alerting on them"

...."key requirement is anomaly detection. Due to the massive scale nobody can look at all these numbers manually. So monitoring systems have to learn normal behaviour and indicate when system behaviour is not normal any more.

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

Wednesday, October 24, 2012

Not a MASF Based Statistical Techniques (Entropy-based) for Anomaly Detection in Data Centers (and Clouds)

The following papers published on Mendeley criticizes the MASF Gaussian assumption and offer other methods (Tukey and Relative Entropy) to detect anomalies statistically. (BTW I tried to use the entropy analysis to capture performance anomalies - check my other post)

1. Statistical techniques for online anomaly detection in data centers
by Chengwei Wang, Krishnamurthy Viswanathan, Lakshminarayan Choudur, Vanish Talwar, Wade Satterfield, Karsten Schwan

Abstract

Online anomaly detection is an important step in data center management, requiring light-weight techniques that provide sufficient accuracy for subsequent diagnosis and management actions. This paper presents statistical techniques based on the Tukey and Relative Entropy statistics, and applies them to data collected from a production environment and to data captured from a testbed for multi-tier web applications running on server class machines. The proposed techniques are lightweight and improve over standard Gaussian assumptions in terms of performance.

2. Online detection of utility cloud anomalies using metric distributions

by Chengwei Wang Chengwei Wang, V Talwar, K Schwan, P Ranganathan

Abstract

The online detection of anomalies is a vital element of operations in data centers and in utility clouds like Amazon EC2. Given ever-increasing data center sizes coupled with the complexities of systems software, applications, and workload patterns, such anomaly detection must operate automatically, at runtime, and without the need for prior knowledge about normal or anomalous behaviors. Further, detection should function for different levels of abstraction like hardware and software, and for the multiple metrics used in cloud computing systems. This paper proposes EbAT - Entropy-based Anomaly Testing - offering novel methods that detect anomalies by analyzing for arbitrary metrics their distributions rather than individual metric thresholds. Entropy is used as a measurement that captures the degree of dispersal or concentration of such distributions, aggregating raw metric data across the cloud stack to form entropy time series. For scalability, such time series can then be combined hierarchically and across multiple cloud subsystems. Experimental results on utility cloud scenarios demonstrate the viability of the approach. EbAT outperforms threshold-based methods with on average 57.4% improvement in accuracy of anomaly detection and also does better by 59.3% on average in false alarm rate with a `near-optimum' threshold-based method.

3. EbAT : Online Methods for Detecting Utility Cloud Anomalies

Chengwei Wang in Middleware (2009)

4. Performance Metric Selection for Autonomic Anomaly Detection on Cloud Computing Systems

Song Fu in 2011 IEEE Global Telecommunications Conference GLOBECOM 2011 (2011)

5. Mining anomalies using traffic feature distributions

Anukool Lakhina, Mark Crovella, Christophe Diot in ACM SIGCOMM Computer Communication Review (2005)

6. Krishnamurthy Viswanathan, Lakshminarayan Choudur, Vanish Talwar et al. (2012) Ranking Anomalies in Data Centers, 1-8. In NOMS.

7. Greg Eisenhauer, Matthew Wolf, Chengwei Wang (2010) Monalytics : Online Monitoring and Analytics for Managing Large Scale Data Centers. In ICAC.

8. Fast Anomaly Detection for Large Data Centers

Ang Li Ang Li, Lin Gu Lin Gu, Kuai Xu Kuai Xu in 2010 IEEE Global Telecommunications Conference GLOBECOM 2010 (2010)

9.Online Reactive Anomaly Detection over Stream Data

Yan Fu Yan Fu, Jun-Lin Zhou Jun-Lin Zhou, Yue Wu Yue Wu in 2008 International Conference on Apperceiving Computing and Intelligence Analysis (2008)

10.Semantic anomaly detection in online data sources

O Raz, P Koopman, M Shaw in Proceedings of the 24th International Conference on Software Engineering ICSE 2002 (2002)

11.Statistical anomaly detection via httpd data analysis

Daniel Q Naiman in Computational Statistics & Data Analysis (2004)

12.A comparative study of real-valued negative selection to statistical anomaly detection techniques

T Stibor, J Timmis, C Eckert in Comparative and General Pharmacology (2005)

Igor Trubin

Saturday, October 20, 2012

Theory of Anomaly Detection: Stanford University Video Lectures

That is the part of Machine Learning Lectures: https://class.coursera.org/ml/lecture/preview/index.

XV. Anomaly Detection (Week 9)

Igor Trubin

Thursday, August 2, 2012

SEDS-Lite: Using Open Source Tools (R, BIRT, MySQL) to Report and Analyze Performance Data - my new CMG'12 paper

20202 UPDATE: The SEDS-Lite web app is about to be released!
_________________________________________________________
I wrote this paper with some help from Shadi G. (from Dublin, also IBMer).
The paper is based on my blog postings:
SEDS-Lite Presentation at Southern CMG Meeting in the SAS Institute
SEDS-Lite Introduction
How To Build IT-Control Chart - Use the Excel Pivot Table!
BIRT based Control Chart

BIRT Data Cubes Usage for Performance Data Analysis

Building IT-Control Chart by BIRT against Data from the MySQL Database

EV-Control Chart

UCL=LCL : How many standard deviations do we use for Control Charting? Use ZERO!

HERE IS THE VIDEO PRESENTATION
Below is the abstract:
Statistical Exception Detection (SEDS) is one of the variations of learning behavior based performance analysis methodology developed, implemented and published by Author. This paper took main SEDS tools – IT-Control Chart and Exceptions (Anomalies) Detector - and showed how that could be built by Open Source type of BI tools, such as R, BIRT and MySQL or just by spreadsheet. The paper includes source codes, tool screen-shots and report input/output examples to allow reader building/developing a light version of SEDS.
-------------------------
The presentation of this paper is scheduled on December 5th, 2012 Wednesday, 2:45:00 PM - 3:45:00 PM in Las Vegas, Nevada
-------------------------

THAT IS MY SECOND CMG'12 PAPER. THE FIRST ONE ANNOUNCED HERE:

AIX frame and LPAR level Capacity Planning. User Case for Online Banking Application

Igor Trubin

Friday, February 17, 2012

Forrester’s “APM and BTM” about CEP - Complex Event Processing

Continuing the previous post subject I looked at another research about APM, which was made a bit earlier in 2010 by Forrester Research, Inc. and called

“Competitive Analysis: Application Performance Management And Business Transaction Monitoring”. The research can be downloaded here.

I found that research also admits importance of usage for APM the “self-learning” related techniques and treated that as a part of CEP - Complex Event Processing.

Based on the research,

“..The Next Step: APM, BTM, BPM, And CEP Converge Complex event processing (CEP) is most probably the first step in the evolution of application performance management. All products reviewed are using some form of statistical-based analysis to distinguish normal from abnormal behavior of applications and transactions. Nastel seems to have taken this analysis one step further by adding a level of inference to its solution. Progress Software has already made the jump into CEP by combining its expertise in BTM and BPM. OpTier recently acquired a solution and announced its intention to enter the advanced field of CEP. SL Corporation, based on its process control automation past, has provided event correlation for a long time, and further integrates with major CEP vendors…”

Below are Vendors that Forester’s research mentioned as having some CEP features (Underlined)

BMC

BPPM Application, Database, and Middleware Monitoring with Analytics monitors transactions running through Web application servers and messaging middleware as well as packaged applications like SAP, Oracle Applications, PeopleSoft, and Siebel CRM. Data collected is automatically integrated with a self-learning analytics engine.

NetIQ

AppManager Performance Profiler is a self-learning, continuously configuring, and continuously adapting technology that profiles dynamic application behavior and sends Trusted Alarms that helps troubleshoot system incidents.

IBM

..(Tivoli) proactively defines autothresholds based on normal behavior.

Nastel Technologies.

AutoPilot CEP integrates events from AutoPilot and third-party monitoring solutions to provide a predictive analysis of application and transaction behavior (normal versus abnormal) and provides a role-based dashboard.

SL Corporation.

RTView Historian allows for persistence of performance metrics via relational databases. The historical data is used for predictive analysis of trends in component and application behavior; historical data provides the ability to create trusted alerts triggered not against fixed thresholds but against dynamically calculated baselines that take into account typical loads during different periods of the workday.

Correlsense

SharePath builds a transaction model for each transaction type to show how it typically utilizes the infrastructure and then creates automatic baselines to provide alerting capabilities and information about a deviation from normal operating tolerances.

Progress Software.

Progress Apama (also part of the RPM Suite) can take information from Actional and perform complex pattern detection activities around it, looking for anomalies that Actional might not otherwise detect. This might include, for example, detecting a cross-correlation between different transactions that might be the root cause of an issue.

Igor Trubin

Monday, January 23, 2012

Quantifying Imbalance in Computer Systems: CMG'11 Trip Report, Part 2

UPDATE 2018:
The technique was successfully tested in the SonR (SEDS based Anomaly detection system) as described in the following post:

"My talk, "Catching Anomaly and Normality in Cloud by Neural Net and Entropy Calculation", has been selected for #CMGimpact 2019

_______________________________________________________ original post:
As I promised in CMG'11 Trip Report, Part 1 here is my comments and some follow up analysis of the following paper: Quantifying Imbalance in Computer Systems that was written and presented at CMG'11 by Charles Loboz from Windows Azure.

The idea is to calculate imbalance of a system by using an entropy property which well know in the physics , economics and in the information theory.

In my other past posting I rose the following question:
"can the information theory (entropy analysis) could be applied to performance exception detection?"

Looks like the idea from the mentioning CMG paper of using entropy calculation against system performance data could lead to the answer of that my question!

Here is the quote from the paper:

"...Theil index is based on entropy - it describes the excess entropy in a system. For a data set xi,

i=1..n the Theil index is given by:

where n is the number of elements in the data set and xavg is the average value of all elements in the data set. To underline the application of the Theil index to measure imbalance in computer systems we call it henceforth the Imbalance Coefficient (IC).

Examining closer the IC formula above we can derive several properties:

(1) the ratio xi/xavg describes how much element i is above or below the average for the whole set. Thus IC involves only the ratio of each element against the average, not the absolute values of theelements.

(2) IC is dimensionless .– thus allows to compare imbalance between sets of substantially different quantities, for example when one set contains disk utilizations and another disk response times.

(3) The minimum value of IC is zero - when all elements of the data set are identical. The maximum value of the Imbalance Coefficient is log(n) - when all elements but one are equal; the maximum IC depends thus on the set size.

(4) We can view Imbalance Coefficient as a description of how concentrated is the use of some resource .– large values mean fewer users use most of the resource, small values mean more equal sharing.

We also define, for convenience, Normalized Imbalance Coefficient (nIC) as

to account for both imbalance within the set and the maximum entropy in that set. The nIC value ranges from 0 to 1 thus enabling comparison of imbalance between data sets with differing number of elements..."

Author applied that to the multiple disks utilization analysis, but he mentioned that approach could be used for measuring other computer subsystems imbalance. So I decided to try to calculate the imbalance of CPU utilization during the day (24 hours) and a week (168 hours) because the imbalance of capacity usage during a day or week is a pretty common concern. Also using my way to group base-line vs. actual data I have applied that twice to compare an "average" weekly/daily utilization vs. last week/days of actual utilization.

The raw data is the same as for the last Control Charting exercise I published here in the series of posts ( see EV-Control Chart as an example), where the actual data (in black) vs. historical averages (in green) are shown below:

Here is the result of calculating the actual vs. averaged nIC Imbalance difference for all 168 hours and for each weekdays (7 days by 24 hours):

You can see that in the day when the anomaly of CPU usage started - Wednesday - the imbalance was significantly different and all in all weekly imbalance was significantly different too! So indeed that metric can be use to capture some performance metric anomalies (pattern changes).

FYI: Here is the spreadsheet snapshot with actual calculation I used:

How better that method of imbalance change checking to compare with more traditional ways to do that (e.g. based on deviations) is hard to say. My personal preference is still EV-concept. Anyway someone needs to try that against more data...

BTW I have found another paper which relates to that topic:

Quantifying Load Imbalance on Virtualized Enterprise Servers by
Emmanuel Arzuaga and David R. Kaeli

In that paper here is the clear statement about imbalance: "A typical imbalance metric based on the resource utilization of physical servers is the standard deviation of the CPU utilization".

Still an entropy is interesting system property that should give us additional good source of information for pattern recognition, I believe. For instance, the balance of Capacity usage of large frames with a lot of LPARS (AIX p7s or VMware hosts) could be monitored by using that nIC metric to apply some possibly an automatic way to rebalanced capacity usage by using partition mobility or v-motion technologies.

Igor Trubin

Friday, January 20, 2012

Control Chart usage in "Automated Analysis of Load Testing Results"

Searching again in http://academic.research.microsoft.com I have found that not only CMG papers have some discussions about anomaly detection/control charting subjects in the Systems Capacity Management field. Below are a few examples:

1. Automated Analysis of Load Testing Results , Zhen Ming Jiang published in Conference: International Symposium on Software Testing and Analysis - ISSTA , pp. 143-146, 2010

From Abstract of the paper: ".. This dissertation proposes
automated approaches to detect functional and performance
problems in a load test by mining the recorded load testing
data (execution logs and performance metrics).."

The paper has reference to three other ones (see below) related to the subject of this blog, I believe:

- I. A. Trubin and L. Merritt. Mainframe global and
workload level statistical exception detection system,
based on masf. In 2004 CMG Conference, 2004

Here is the content where my paper was referenced:
"... It is di cult for humans to interpret raw performance
metrics, as it is not clear how to categorize these raw met-
ric values into performance categories (e.g. high, medium
and low). Furthermore, some data mining algorithms (e.g.
Navie Bayes Classi er) only take discrete values as input.
We are currently exploring generic approaches to classify
performance metrics into discrete performance categories us-
ing techniques like control charts [Trubin's CMG'04 paper] to facilitate our future
work in performance analysis...."

BTW Here is a slide with MIPS control chart from that paper presentation:

2. L. Cherkasova, K. Ozonat, N. Mi, J. Symons, and
E. Smirni. Anomaly? application change? or workload
change? towards automated detection of application
performance anomaly and change. In IEEE
International Conference on Dependable Systems and
Networks, 2008.

2. B. Anton, M. Leonardo, and P. Fabrizio. Ava:
Automated interpretation of dynamically detected
anomalies. In Proceedings of the Eighteenth
International Symposium on Software Testing and
Analysis, 2009.

I plan to find and read the last two papers and maybe to report something here....

Igor Trubin

Popular Post

_

Monday, March 28, 2016

Wednesday, October 24, 2012

Abstract

Saturday, October 20, 2012

Thursday, August 2, 2012

Friday, February 17, 2012

Monday, January 23, 2012

Friday, January 20, 2012