Popular Post


Tuesday, December 17, 2019

WOSP-C 2020 at #ICPE (#spec) #conference: Workshop on Challenges and Opportunities in Large Scale #Performance. Call for Contributions

Topic: Performance and System Measurement and Analytics
Dates: April 20th or 21st, 2020 Co-located with @ICPE2020, Edmonton, Alberta wosp-c.github.io/wosp-c-20/

Invited Speakers 

  • Dr. Alberto Avritzer, esuLabs Solutions
  • Topic: Automated Scalability Assessment in DevOps Environments Invited
  • Dr. Tomer Morad, Concertio, New York, USA
  • Topic: Leveraging Machine Learning to Automate Performance Tuning
  • Dr. Igor Trubin (this is me), Capital One Bank, Virginia, USA
  • Topic: Performance Anomaly and Change Point Detection for Large-Scale System Management
  • Dr. Boris Zibitsker, BEZNext, Illinois, USA
  • Topic: TBA


This workshop will include invited talks to stimulate ideas, contributed papers, and a discussion session on topics that would benefit from in-depth consideration. Parallel discussion sessions may be organized towards the end of the day depending on interests.


Submissions may take the form of two-page extended abstracts or six-page papers. All submissions will be reviewed by at least two members of the program committee. Submit 
They should be submitted via Easychair at EasyChair WOSPC2020
  • Submission deadline: January 12, 2020
  • Notifications to authors: February 2nd, 2020
  • Camera-ready copy (hard deadline): February 24th, 2020

Thursday, December 12, 2019

That is very sad #CMGnews - www.CMG.org member Dr. Bernard Domanski passed away. I never forget his presentations at CMG conferences! One of them inspired me to be a blogger and I am still trying (www.Trub.in) and will be remembering him at any my future posting there!

Looking back to 2019 - CMG'19 relevant presentation #4: Catching Anomaly And Normality In Cloud By Neural Net And Entropy Calculation

It is really relevant as it is my presentation....

IMPACT 2019: Catching Anomaly And Normality In Cloud By Neural Net And Entropy Calculation – Igor Trubin, Capital One Bank

Looking back to 2019 - CMG'19 relevant presentation #3: Capacity Planning Under Cloudy Skies

It is relevant as the keynote speaker of this presentation is my product owner!

IMPACT 2019: Capacity Planning Under Cloudy Skies – Brian Wong, Capital One

Much of the world is moving to cloud, and many significant parts of IT are changing at the same time. Capacity planning is far from obsolete, but practicing it in a rapidly changing environment will be very different from practicing it in the past. This session shows where the field is going.
To view the full video you must have a IMPACT 2019 video membership. Sign up today!

Looking back to 2019 - CMG'19 relevant presentation #2: The Motivations And Benefits Of Building A Home-Grown Capacity Management System

That is relevant Because all my Capacity Management Career i deal with home-grown capacity management systems (and created one of them myself - SETDS/SonR)
So the title:

IMPACT 2019: The Motivations And Benefits Of Building A Home-Grown Capacity Management System – Len Wyatt, Blackbaud, Inc.

There are commercial capacity management tools out there, so why would anyone build their own? For Blackbaud, there were both technical and organizational reasons. There have been some surprising benefits. This talk surveys the motivations, the architecture, the benefits and the tradeoffs of creating your own Capacity Management Data Warehouse (CMDW).
The primary uses for CMDW include monitoring current systems for capacity concerns, trying to anticipate upcoming needs, forecasting resource needs as we move from physical servers to a virtualized data center and to a cloud-based infrastructure, and troubleshooting issues based on data that was not visible before. On the monitoring front, as the number of systems monitored has grown, we have moved toward a process of looking for statistical anomalies in the data and having people investigate data that was first uncovered by the statistics. Forecasting is still a spreadsheet-driven process, but the CMDW provides data that lets us project from current environments to future configurations.
This talk will outline the variety of data sources that feed into CMDW, dive into how a traditional data warehouse architecture using a relational database has worked as the core mechanism, and where we extended the concepts of a relational warehouse for synchronizing disparate data sources and for doing statistical analysis. A look at the varied uses of CMDW comes next: both the things we planned to do and the things that popped up once people saw the ability to collect and analyze data in new ways. At the end, we’ll “fess up” to some of the issues that came up as well.
To view the full video you must have a IMPACT 2019 video membership. Sign up today!

Looking back to 2019 - CMG'19 relevant presentation #1: The Fine Art Of Combining Capacity Management With Machine Learning

IMPACT 2019: The Fine Art Of Combining Capacity Management With Machine Learning – Charles W. Johnson Jr., Syncsort, Inc.

Capacity Management within the enterprise continues to evolve. In the past we were focused on the hardware; now we are focused on services. With that in mind, the amount of data available has increased significantly and has become difficult for the Capacity Manager to sort through.
To be successful with this discipline going forward, we need the machines to do more of the heavy lifting. This includes automatically creating reports, calling out anomalies, and producing forecasts. All this still requires the human computer to perform the sanity checks on the anomalies and forecasts. The intuition of the human computer is imperative to our success. The bonding of the human computer and physical machine has become critical in performing Capacity Management.
In this presentation, we will discuss Capacity Management with and without Machine Learning, provide examples of what Machine Learning can provide in the process, and demonstrate outcomes using the strengths of both to make Capacity Management a successful component within your organization.
To view the full video you must have a IMPACT 2019 video membership. Sign up today!

Tuesday, October 15, 2019


#CMGnews: Meeting of the Minds: AI, ML, and DL at Impact'20 conference - I GO! YOU?

Meeting of the Minds: AI, ML, and DL

Join Amy Peck and Igor Trubin for this birds-of-a-feather session on Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). During this hour-long, casual session, participants will have the opportunity to engage in an open forum and share their own work and challenges related to AI, ML, and DL.
Topics: AI, ML, and DL
Time and Day: 5:15PM Monday, February 10

Thursday, October 10, 2019

Tuesday, August 20, 2019

Friday, June 14, 2019


Re-posting interesting article from


R vs. Python for Data Science

Norm Matloff, Prof. of Computer Science, UC Davis; my bio

Hello! This Web page is aimed at shedding some light on the perennial R-vs.-Python debates in the Data Science community. As a professional computer scientist and statistician, I hope to shed some useful light on the topic. I have potential bias — I've written 4 R-related books, and currently serve as Editor-in-Chief of the R Journal — but I hope this analysis will be considered fair and helpful.


Clear win for Python.
This is subjective, of course, but having written (and taught) in many different programming languages, I really appreciate Python's greatly reduced use of parentheses and braces:
if x > y: 
   z = 5
   w = 8
if (x > y)
   z = 5
   w = 8
Python is sleek!

Learning curve

Huge win for R.
To even get started in Data Science with Python, one must learn a lot of material not in base Python, e.g., NumPy, Pandas and matplotlib.
By contrast, matrix types and basic graphics are built-in to base R. The novice can be doing simple data analyses within minutes. Python libraries can be tricky to configure, even for the systems-savvy, while most R packages run right out of the box.

Available libraries

Call it a tie.
CRAN has over 12,000 packages. PyPI has over 183,000, but it seems thin on Data Science.
For example, I once needed code to do fast calculation of nearest-neighbors of a given data point. (NOT code using that to do classification.) I was able to immediately find not one but two packages to do this. By contrast, just now I tried to find nearest-neighbor code for Python and at least with my cursory search, came up empty-handed; there was just one implementation that described itself as simple and straightforward, nothing fast.
The following searches in PyPI turned up nothing: log-linear model; Poisson regression; instrumental variables; spatial data; familywise error rate; etc.

Machine learning

Slight edge to Python here.
The Pythonistas would point to a number of very finely-tuned libraries, e.g. AlexNet, for image recognition. Good, but R versions easily could be developed. The Python libraries' power comes from setting certain image-smoothing ops, which easily could be implemented in R's Keras wrapper, and for that matter, a pure-R version of TensorFlow could be developed. Meanwhile, I would claim that R's package availabity for random forests and gradient boosting are outstandng.

Statisical correctness

Big win for R.
In my book, the Art of R Programming, I made the statement, "R is written by statisticians, for statisticians," which I'm pleased to see pop up here and there on occasion. It's important!
To be blunt, I find the machine learning people, who mostly advocate Python, often have a poor understanding of, and in some cases even a disdain for, the statistical issues in ML. I was shocked recently, for instance, to see one of the most prominent ML people, state in his otherwise outstanding book that standardizing the data to mean-0, variance-1 means one is assuming the data are Gaussian — absolutely false and misleading.

Parallel computation

Let's call it a tie.
Neither the base version of R nor Python have good support for multicore computation. Threads in Python are nice for I/O, but parallel computation using them is impossible, due to the infamous Global Interpreter Lock. Python's multiprocessing package is not a good workaround, nor is R's 'parallel' package. External libraries supporting cluster computation are OK in both languages.
Currently Python has better interfaces to GPUs.

C/C++ interface

Slight win for R.
Though there are tools like swig etc. for interfacing Python to C/C++, as far is I know there is nothing remotely as powerful as R's Rcpp for this at present. The Pybind11 package is being developed.
In addition, R's new ALTREP idea has great potential for enhancing performance and usability.
On the other hand, the Cython and PyPy variants of Python can in some cases obviate the need for explicit C/C++ interface in the first place.

Object orientation, metaprogramming

Slight win for R.
For instance, though functions are objects in both languages, R takes that more seriously than does Python. Whenever I work in Python, I'm annoyed by the fact that I cannot print a function to the terminal, which I do a lot in R.
Python has just one OOP paradigm. In R, you have your choice of several, though some may debate that this is a good thing.
Given R's magic metaprogramming features (code that produces code), computer scientists ought to be drooling over R.

Language unity

Horrible loss for R.
Python is currently undergoing a transition from version 2.7 to 3.x. This will cause some disruption, but nothing too elaborate.
By contrast, R is rapidly devolving into two mutually unintelligible dialects, ordinary R and the Tidyverse. Sadly, this is a conscious effort by a commercial entity that has come to dominate the R world, RStudio. I know and admire the people at RStudio, but a commercial entity should not have such undue influence on an open-source project.
It might be more acceptable if the Tidyverse were superior to ordinary R, but in my opinion it is not. It makes things more difficult for beginners. E.g. the Tidyverse has so many functions, some complex, that must be learned to do what are very simple operations in base R. Pipes, apparently meant to help beginners learn R, actually make it more difficult, I believe. And the Tidyverse is of questionable value for advanced users.

Linked data structures

Likely win for Python.
Classical computer science data structures, e.g. binary trees, are easy to implement in Python. While this can be done in R using its 'list' class, I'd guess that it is slow.

R/Python interoperability

RStudio is to be commended for developing the reticulate package, to serve as a bridge between Python and R. It's an outstanding effort, and works well for pure computation. But as far as I can tell, it does not solve the knotty problems that arise in Python, e.g. virtual environments and the like.
At present, I do not recommend writing mixed Python/R code.

Friday, May 3, 2019

CMG #THEXCHANGE #VirtualConference is all about #scalability and #reliability

The next CMG #VirtualConference has just been announced. This next event is all about
#scalability and #reliability. No need to step out of the office, CMG is bringing the conference to
you. Don't miss out on the #THEXCHANGE virtual conference coming to an iPad or computer
near you on June 18, 2019. Reserve your “virtual seat” here: http://bit.ly/2UDj4IR #cmgnews

Wednesday, May 1, 2019

I will moderate #CMGExpert Roundtable on #AI - please join!

Please join us for our Roundtable on . Participants will engage in a small group discussion on the topic lead by moderator of . After the session, CMG will publish a piece recapping the session for the rest of the . Participants have the potential to connect with some of our influential , , and readers. Interested in attending CMG’s New Event Series? For more information and to register, click here: http://bit.ly/2DIZuBk