System Management by Exception
This blog relates to experiences in the Systems Capacity and Availability areas, focusing on statistical filtering and pattern recognition and BI analysis and reporting techniques (SPC, APC, MASF, 6-SIGMA, SEDS/SETDS and other)
Popular Post
-
I have got the comment on my previous post “ BIRT based Control Chart “ with questions about how actually in BIRT the data are prepared for ...
-
Your are welcome to post to this blog any message related to the Capacity, Performance and/or Availability of computer systems. Just put you...
_
Thursday, March 24, 2022
Our poster presentation "SPEC Research — Introducing the #PredictiveAnalytics Working Group" is scheduled at #ICPE2022 #ICPEconf Poster & Demo (Monday - April 11, 2022, 5:15pm)

Wednesday, March 16, 2022
I am happy to co-author 2 papers for #ICPE2022 #ICPEconf
Online conference program https://icpe2022.spec.org/program_files/schedule/ scheduled our following presentations:
Poster & Demo (Monday - April 11, 2022, 5:15pm )
André Bauer, Mark Leznik, Md Shahriar Iqbal, Daniel Seybold, Igor Trubin, Benjamin Erb, Jörg Domaschka and Pooyan Jamshidi. SPEC Research — Introducing the Predictive Data Analytics Working Group
Data Challenge (Tuesday - April 12,, 4:15pm - 4:55pm)
Md Shahriar Iqbal, Mark Leznik, Igor Trubin, Arne Lochner, Pooyan Jamshidi and André Bauer. Change Point Detection for MongoDB Time Series Performance Regression

Monday, February 28, 2022
"Change Point Detection (#ChangeDetection) for MongoDB Time Series Performance Regression" paper for ACM/SPEC ICPE 2022 Data Challenge Track
The ACM/SPEC ICPE 2022 - Data Challenge Track Committee has decided to ACCEPT our article:
TITLE: Change Point Detection for MongoDB Time Series Performance Regression
AUTHORS: Md Shahriar Iqbal, Mark Leznik, Igor Trubin, Arne Lochner, Pooyan Jamshidi and André Bauer
CPD - Change Point Detection (#ChangeDetection) is implemented in the free web tool Perfomalist

Wednesday, February 9, 2022
My Cloud Optimization team at #CapitalOne bank won the CMG.org #Innovation Award (#CMGNews)

Thursday, February 3, 2022
My publications in RG got 5000+ reads

Friday, January 21, 2022
Panel Discussion: Roadmap for Cultivating Performance-Aware Software Engineers

"#CloudServers Rightsizing with #Seasonality Adjustments" - my presentation at CMG IMPACT conference (#CMGnews)

Thursday, January 6, 2022
"Performance Anomaly and Change Point Detection for Large-Scale System Management" - my paper published at Springer
Intelligent Sustainable Systems pp 403-407| Cite as
Performance Anomaly and Change Point Detection for Large-Scale System Management
- 1Downloads
Abstract
The presentation starts with the short overview of the classical statistical process control (SPC)-based anomaly detection techniques and tools including Multivariate Adaptive Statistical Filtering (MASF); Statistical Exception and Trend Detection System (SETDS), Exception Value (EV) meta-metric-based change point detection; control charts; business driven massive prediction and methods of using them to manage large-scale systems such as on-prem servers fleet or massive clouds. Then, the presentation is focused on modern techniques of anomaly and normality detection, such as deep learning and entropy-based anomalous pattern detections.
Keywords
Anomaly detection Change point detection Business driven forecast Control chart Deep Learning Entropy analysisReferences
- 1.Trubin, I.: Exception based modeling and forecasting. In: Proceedings of Computer Measurement Group (2008)Google Scholar
- 2.Jeffrey Buzen, F., Annie Shum, S.: MASF—multivariate adaptive statistical filtering. In: Proceedings of Computer Measurement Group (1995)Google Scholar
- 3.Trubin, I.: Review of IT control chart. CIS J. 4(11), 2079–8407 (2013)Google Scholar
- 4.Perfomalist Homepage, http://www.perfomalist.com. Last accessed on 10 June 2021
- 5.Trubin, I., et al.: Systems and methods for modeling computer resource metrics. US Patent 10,437,697 (2016)Google Scholar
- 6.Trubin, I.: Capturing workload pathology by statistical exception detection. In: Proceedings of Computer Measurement Group (2005)Google Scholar
- 7.Loboz, C.: Quantifying imbalance in computer systems. In: Proceedings of Computer Measurement Group (2011)Google Scholar

Thursday, December 2, 2021
Dynamics of Anomalies or Phases in a Dynamic Object Life
A dynamic object may have following several phases in its lifetime:
1. Initial phase to set a norm - anomalies cannot be detected as there is no baseline sample is established yet. Could be tired later as an outlier.
2. Stable period without any anomalies.
3. Unstable period when anomalies are appearing: suddenly or with gradually increasing rate.
4. Anomalies are introducing a new norm and the rate of anomalies is gradually decreasing.
5. =>2. The next stable period.
6. =>3. … and so on.
To detect those dynamic object phases one can use Anomaly and Change Point detection methods. One of them is SETDS (described in this blog), which has been implementing now as a www.Perfomalist.com tool.
Here is an example how the Perfomalist (Download Input Data Sample) test data is used to detect stable and unstable periods.
Data consists of 28 weeks. To see some dynamic and to catch when anomalies started appearing, the data was divided into 23 data sets.
- The 1st one has 4 initial weeks (initial baseline or reference/learning set) plus following week (1st "current" week).
- The 2nd one has 5 initial weeks as the next (on one week bigger) baseline and following week as the next "current" week.
- The 3rd one... the same mechanism as described above.
Then the www.Perfomalist.com was applied 23 times (could be automated using Personalist APIs) and results were combined into the spreadsheet.
The table and daily summarized charts are below. The result shows clearly 2nd (stable) and 3rd (unstable) phases.

Tuesday, November 23, 2021
Join me with CMG – your technology community – at #CMGIMPACT22. Use code Trubin at cmgimpact.com/ for 50% off IMPACT tickets cmgimpact.com/register/ #cmgnews #technology #InformationTechnology #ITconference #ContinuingEducation #ProfessionalDevelopment
When the cloud servers rightsizing algorithm calculates the baseline level for the current year application server’s usage, the seasonal adjustment needs to be calculated and applied by adding the additional anticipated change, which could be increasing or decreasing the capacity usage. We describe the method and illustrate it against the real data.
The cloud servers rightsizing recommendation generated based on seasonality adjustments, would reflect the seasonal patterns, and prevent any potential capacity issues or reduce an excess capacity.
The ability to keep multi-year historical data of 4 main subsystems of application servers’ capacity usage opens the opportunity to detect seasonality changes and estimate additional capacity needs for CPU, memory, disk I/Os, and network. A multi-subsystem approach is necessary, as very often the nature of the application could be not CPU but I/Os or Memory or Network-intensive.
Applying the method daily allows downsizing correctly if the peak season passes and the available capacity should be decreased, which is a good way to achieve cost savings.
In the session, the detailed seasonality adjustment method is described and illustrated against the real data. The method is based on and developed by the author’s SETDS methodology, which treats the seasonal variation as an exception (anomaly) and calculates adjustments as variations from a linear trend.
Key Takeaways
- How to build seasonal adjustments into the cloud rightsizing
- To get familiar with cloud objects rightsizing techniques

Monday, November 22, 2021
The Change Point Detection SETDS based method is implemented as a Perfomalist API. Everybody is welcome to test!
How to use it explained HERE:
https://www.trutechdev.com/2021/11/the-change-points-detection-perfomalapi.html

Saturday, October 30, 2021
My presentation "Cloud Servers Rightsizing with Seasonality Adjustments" has been accepted for CMG IMPACT 2022. #CMGnews

Tuesday, September 21, 2021
Got my 1st #AWScertification

Friday, July 30, 2021
"Performance #Anomaly and #ChangePointDetection For Large-Scale System Management" for WorldS4 2021 - my presentation slides deck is available on RG

Friday, July 23, 2021
I'm excited to present my paper "Performance #Anomaly and Change Point Detection for Large-Scale System Management" at 5th World Conference on Smart Trends in Systems, Security and Sustainability

Tuesday, July 20, 2021
Presenting in London - "Performance Anomaly and Change Point Detection For Large-Scale System Management"
I will be presenting at the Worlds3 conference in London my paper
"Performance Anomaly and Change Point Detection For Large-Scale System Management"
(https://www.researchgate.net/publication/340926055_Performance_Anomaly_and_Change_Point_Detection_For_Large-Scale_System_Management)
Time slot in London time: 04:30 - 06:00 on 29th July 2021

Friday, June 11, 2021
Cloud Capacity Management Explained by CMG.org - #cmgnews
CMG publications about Cloud Capacity Management (some links accessible only for CMG members)
Cloud Capacity Management (PDF doc from Metron-Athene)
8 Things You Need to Know About Capacity Planning for the Cloud (helpsystem)
How to Do Capacity Management in the Cloud (helpsystem).
(in UT) How to do Capacity Management in the Cloud Text is HERE (TeamQuest)
Cloud Capacity Management (PDF doc from Metron-Athene)
8 Things You Need to Know About Capacity Planning for the Cloud (helpsystem)
How to Do Capacity Management in the Cloud (helpsystem).
(in UT) How to do Capacity Management in the Cloud Text is HERE (TeamQuest)

Monday, June 7, 2021
How am I doing? LinkedIn recommendations (this year)

Thursday, March 25, 2021
SEDS based "CLOUD RESOURCES WORKLOAD PROFILING"
Based on SEDS method the workload profiling of main cloud objects (AWS EC2, RDSand EBS) are implemented at my current work.
Next Tuesday 3/30 at 12:30 pm EST I will be sharing my experience of building and using this method at the Data Centers and Cloud Infrastructure virtual CMG.org
ABSTRACT: How to be sure a cloud object’s (e.g, AWS EC2, RDS or EBS) workload fits the rightsized resources (Compute, RAM, IO/s and Network traffic)? It is very difficult to do using raw system performance data from monitoring tools. The best way to do that is using a weekly workload profile, which is a graphical visualization in form of MASF IT-Control chart. This chart shows the stability of the workload, reveals the anomalies that happened recently, such as run-away, memory leaks or specifically important for cloud objects, the unusual number of hours the object is down all compared with the usual weekly pattern.
This presentation will describe how to build, read, and use workload profiles using real data examples and demonstrates how cloud capacity scaling could be verified.

Thursday, January 21, 2021
"Performance problem diagnosis in cloud infrastructures" (#CloudComputing #AnomalyDetection #ControlChart)
I found this interesting research (2016) written by Olumuyiwa Ibidunmoye, which has a reference to my 2004 paper (Capturing Workload Pathology by Statistical Exception)
Abstract
Cloud datacenters comprise hundreds or thousands of disparate application services, each having stringent performance and availability requirements, sharing a finite set of heterogeneous hardware and software resources. The implication of such complex environment is that the occurrence of performance problems, such as slow application response and unplanned downtimes, has become a norm rather than exception resulting in decreased revenue, damaged reputation, and huge human-effort in diagnosis. Though causes can be as varied as application issues (e.g. bugs), machine-level failures (e.g. faulty server), and operator errors (e.g. mis-configurations), recent studies have attributed capacity-related issues, such as resource shortage and contention, as the cause of most performance problems on the Internet today. As cloud datacenters become increasingly autonomous there is need for automated performance diagnosis systems that can adapt their operation to reflect the changing workload and topology in the infrastructure. In particular, such systems should be able to detect anomalous performance events, uncover manifestations of capacity bottlenecks, localize actual root-cause(s), and possibly suggest or actuate corrections. This thesis investigates approaches for diagnosing performance problems in cloud infrastructures. We present the outcome of an extensive survey of existing research contributions addressing performance diagnosis in diverse systems domains. We also present models and algorithms for detecting anomalies in real-time application performance and identification of anomalous datacenter resources based on operational metrics and spatial dependency across datacenter components. Empirical evaluations of our approaches shows how they can be used to improve end-user experience, service assurance and support root-cause analysis.
Control Charting Example from the paper

Anomalies Detection and Cloud Platform Selection During DevOps - #CMGimpact conference interesting session (#CMGnews #CloudComputing #AnomalyDetection)
The topic is interesting as it relates to two my current interests: Clouding and AD. It is scheduled for today evening - https://cmgimpact.com/anomalies-detection-and-cloud-platform-selection-during-devops/
Here is abstract:
"In this session, the presenters will review the challenges of anomaly detection during DevOps and discuss the methodology and use case of cloud platform selection for the application. There will be a focus on applying iterative modeling and gradient optimization to determine the minimum configuration and cost required to support new applications Service Level Goals in different clouds"

Wednesday, January 20, 2021
Enjoying Virtual CMG Impact conference. #CMGnews

Cloud Capacity Management (#CloudComputing #CapacityManagement)
I have created a LinkedIn group to discuss this. Please join: https://www.linkedin.com/groups/13935809/

Tuesday, January 19, 2021
My Agenda for CMG Impact 2021 conference
The conference has started - https://cmgimpact.com/home2021/
Here is my Agenda with sessions I plan to attend:
1/19
11:00 AM
11:40 AM
Dynamic Capacity Management for Hybrid Multi-Cloud Environment
7:10 PM
7:50 PM
Failing over without falling over
8:05 PM
8:45 PM
Automated Data Visualization
1/20
9:10 AM
9:50 AM
A Guide to Event-driven SRE-inspired DevOps: The end of your monolithic release process
10:05 AM
10:35 AM
Managing application SLAs using Traces and Metrics
9:00 PM
9:40 PM
Cost-to-Serve: Computing Transaction Efficiency
1/21
10:05 AM
10:45 AM
Capacity and performance before and after a major infrastructure transition
10:05 AM
10:45 AM
MythBusters vs. Queuing Theory – Reality Trumps Theory
11:00 AM
11:40 AM
Detection of Performance Anomaly in Mobile Network Node Entities in Evolved Packet Core Network Using Deep Embedded Self Organizing Map (DESOM)
7:10 PM
7:50 PM
The Golden Era of AI and Machine Learning: The case for on-premise AI resources
8:05 PM
8:45 PM
Anomalies Detection and Cloud Platform Selection During DevOps
8:45 PM
9:00 PM
Q&A with Boris Zibitsker, Alex Lupersolsky, Pavel Pratasevich, Justin Bleuel
1/26
9:10 AM
9:50 AM
Determining the Best Use of AI to Meet Your IT Ops Needs
10:05 AM
10:45 AM
Overcoming Automation Fear in Infrastructure as Code
10:05 AM
10:45 AM
Technical methods to solve performance issues
10:05 AM
10:45 AM
Practical Ways to Leverage AI in your IT Operations
11:00 AM
11:40 AM
Providing Business Value Through Observability
7:10 PM
7:50 PM
Transforming humanity with Lean, Agile and DevOps methodologies during a global pandemic
8:05 PM
8:45 PM
How Deep Learning Model Architecture Impacts Optimal Training Configuration in the Cloud
8:05 PM
8:45 PM
The Past, Present, and Future of Performance Engineering
9:00 PM
9:40 PM
Capacity Analysis Techniques for VMware VM I/O
9:00 PM
9:40 PM
Using Machine Learning for Software Capacity Planning
1/27
9:10 AM
9:50 AM
Keynote Presentation with Harry Moseley, CIO of Zoom Video Communications
10:05 AM
10:45 AM
Performance Engineering - Fin Ops
11:00 AM
11:40 AM
Concurrent Users - An analytical approach to proper workload simulation
7:10 PM
7:50 PM
Everything I Need To Know About AIOps I Learned From My Rice Cooker
8:05 PM
8:45 PM
Python Performance And Other Non-Functional Testing Techniques
1/28
9:05 AM
10:05 AM
Digital Twins in a Pandemic: Use Simulation Data for Quadcopter Mission Planning
10:05 AM
10:45 AM
Improving Forecasting for Capacity Management Using Segmented Regression
10:05 AM
10:45 AM
When your site fails it can be great for business!
11:00 AM
11:40 AM
AI Data Acquisition and Governance: Considerations for Success
8:05 PM
8:45 PM
Cloud Resources Workload Profiling (THAT IS MY PRESENTATION)
8:45 PM
9:00 PM
Q&A for Cloud Resources Workload Profiling
9:00 PM
9:40 PM
Performance Engineering Maturity Model: The Four Stages of Performance Culture
THE FULL CALENDAR IS: https://calendar.google.com/calendar/u/0/embed?src=c_grtrojtjq0oiebd7fm2foj375o@group.calendar.google.com&ctz=America/New_York
