System Management by Exception: Performance Problem Diagnosis in Cloud Infrastructures

Friday, July 8, 2016

Performance Problem Diagnosis in Cloud Infrastructures

I have been notified by RG about new reference to my paper "Capturing Workload Pathology by Statistical Exception Detection System". The following interesting tithes referenced my work:

Performance Problem Diagnosis in Cloud Infrastructures

Umeå University

Abstract

Cloud datacenters comprise hundreds or thousands of disparate application services, each having stringent performance and availability requirements, sharing a finite set of heterogeneous hardware and software resources. The implication of such complex environment is that the occurrence of performance problems, such as slow application response and unplanned downtimes, has become a norm rather than exception resulting in decreased revenue, damaged reputation, and huge human-effort in diagnosis. Though causes can be as varied as application issues (e.g. bugs), machine-level failures (e.g. faulty server), and operator errors (e.g. mis-configurations), recent studies have attributed capacity-related issues, such as resource shortage and contention, as the cause of most performance problems on the Internet today. As cloud datacenters become increasingly autonomous there is need for automated performance diagnosis systems that can adapt their operation to reflect the changing workload and topology in the infrastructure. In particular, such systems should be able to detect anomalous performance events, uncover manifestations of capacity bottlenecks, localize actual root-cause(s), and possibly suggest or actuate corrections.
This thesis investigates approaches for diagnosing performance problems in cloud infrastructures. We present the outcome of an extensive survey of existing research contributions addressing performance diagnosis in diverse systems domains. We also present models and algorithms for detecting anomalies in real-time application performance and identification of anomalous datacenter resources based on operational metrics and spatial dependency across datacenter components. Empirical evaluations of our approaches shows how they can be used to improve end-user experience, service assurance and support root-cause analysis.

Igor Trubin

He started in 1979 as IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for 12 years. He published 30+ papers and made several presentations for conferences related to the Robotics and Artificial Intelligent fields. In 1999 he moved to the US, worked at Capital One bank as a Capacity Planner. His first CMG.org paper was written and presented in 2001. The next one, "Exception Detection System Based on MASF Technique," won a Best Paper award at CMG'02 and was presented at UKCMG'03 in Oxford, England. He made other tech. presentations at IBM z/Series Expo, SPEC.org, Southern and Central Europe CMG and ran several workshops covering his original method of Anomaly and Change Point Detection (Perfomalist.com). Author of “Performance Anomaly Detection” class (at CMG.com). Worked 2 years as the Capacity team lead for IBM, worked for SunTrust Bank for 3 years and then at IBM for 3 years as Sr. IT Architect. Now he works for Capital One bank as IT Manager at the Cloud Engineering and since 2015 he is a member of CMG.org Board of Directors. Runs UT channel iTrubin

2 comments:

Anonymous said...: 30903C74A4
hacker arıyorum
hacker kiralama
tütün dünyası
hacker bul
hacker kirala; September 07, 2025
Anonymous said...: Günümüzde en çok tercih edilen aktivitelerden biri oyun oynamaktır ve bu nedenle birçok kişi güvenilir kaynaklardan oyun indir seçeneği aramaktadır. Eğer siz de en güncel ve güvenli oyunlara ulaşmak istiyorsanız, oyun indir sitesini ziyaret edebilirsiniz. Burada çeşitli kategorilerde birçok oyunu hızlı ve kolay bir şekilde bulabilirsiniz. Böylece zaman kaybetmeden sevdiğiniz oyunlara ulaşmanın keyfini çıkarabilirsiniz.; February 12, 2026

System Management by Exception

Popular Post

_

Friday, July 8, 2016

Performance Problem Diagnosis in Cloud Infrastructures

I have been notified by RG about new reference to my paper "Capturing Workload Pathology by Statistical Exception Detection System". The following interesting tithes referenced my work:

2 comments:

Post a Comment