Popular Post

Tuesday, October 15, 2019


#CMGnews: Meeting of the Minds: AI, ML, and DL at Impact'20 conference - I GO! YOU?

Meeting of the Minds: AI, ML, and DL

Join Amy Peck and Igor Trubin for this birds-of-a-feather session on Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). During this hour-long, casual session, participants will have the opportunity to engage in an open forum and share their own work and challenges related to AI, ML, and DL.
Topics: AI, ML, and DL
Time and Day: 5:15PM Monday, February 10

Tuesday, August 20, 2019

Friday, June 14, 2019


Re-posting interesting article from


R vs. Python for Data Science

Norm Matloff, Prof. of Computer Science, UC Davis; my bio

Hello! This Web page is aimed at shedding some light on the perennial R-vs.-Python debates in the Data Science community. As a professional computer scientist and statistician, I hope to shed some useful light on the topic. I have potential bias — I've written 4 R-related books, and currently serve as Editor-in-Chief of the R Journal — but I hope this analysis will be considered fair and helpful.


Clear win for Python.
This is subjective, of course, but having written (and taught) in many different programming languages, I really appreciate Python's greatly reduced use of parentheses and braces:
if x > y: 
   z = 5
   w = 8
if (x > y)
   z = 5
   w = 8
Python is sleek!

Learning curve

Huge win for R.
To even get started in Data Science with Python, one must learn a lot of material not in base Python, e.g., NumPy, Pandas and matplotlib.
By contrast, matrix types and basic graphics are built-in to base R. The novice can be doing simple data analyses within minutes. Python libraries can be tricky to configure, even for the systems-savvy, while most R packages run right out of the box.

Available libraries

Call it a tie.
CRAN has over 12,000 packages. PyPI has over 183,000, but it seems thin on Data Science.
For example, I once needed code to do fast calculation of nearest-neighbors of a given data point. (NOT code using that to do classification.) I was able to immediately find not one but two packages to do this. By contrast, just now I tried to find nearest-neighbor code for Python and at least with my cursory search, came up empty-handed; there was just one implementation that described itself as simple and straightforward, nothing fast.
The following searches in PyPI turned up nothing: log-linear model; Poisson regression; instrumental variables; spatial data; familywise error rate; etc.

Machine learning

Slight edge to Python here.
The Pythonistas would point to a number of very finely-tuned libraries, e.g. AlexNet, for image recognition. Good, but R versions easily could be developed. The Python libraries' power comes from setting certain image-smoothing ops, which easily could be implemented in R's Keras wrapper, and for that matter, a pure-R version of TensorFlow could be developed. Meanwhile, I would claim that R's package availabity for random forests and gradient boosting are outstandng.

Statisical correctness

Big win for R.
In my book, the Art of R Programming, I made the statement, "R is written by statisticians, for statisticians," which I'm pleased to see pop up here and there on occasion. It's important!
To be blunt, I find the machine learning people, who mostly advocate Python, often have a poor understanding of, and in some cases even a disdain for, the statistical issues in ML. I was shocked recently, for instance, to see one of the most prominent ML people, state in his otherwise outstanding book that standardizing the data to mean-0, variance-1 means one is assuming the data are Gaussian — absolutely false and misleading.

Parallel computation

Let's call it a tie.
Neither the base version of R nor Python have good support for multicore computation. Threads in Python are nice for I/O, but parallel computation using them is impossible, due to the infamous Global Interpreter Lock. Python's multiprocessing package is not a good workaround, nor is R's 'parallel' package. External libraries supporting cluster computation are OK in both languages.
Currently Python has better interfaces to GPUs.

C/C++ interface

Slight win for R.
Though there are tools like swig etc. for interfacing Python to C/C++, as far is I know there is nothing remotely as powerful as R's Rcpp for this at present. The Pybind11 package is being developed.
In addition, R's new ALTREP idea has great potential for enhancing performance and usability.
On the other hand, the Cython and PyPy variants of Python can in some cases obviate the need for explicit C/C++ interface in the first place.

Object orientation, metaprogramming

Slight win for R.
For instance, though functions are objects in both languages, R takes that more seriously than does Python. Whenever I work in Python, I'm annoyed by the fact that I cannot print a function to the terminal, which I do a lot in R.
Python has just one OOP paradigm. In R, you have your choice of several, though some may debate that this is a good thing.
Given R's magic metaprogramming features (code that produces code), computer scientists ought to be drooling over R.

Language unity

Horrible loss for R.
Python is currently undergoing a transition from version 2.7 to 3.x. This will cause some disruption, but nothing too elaborate.
By contrast, R is rapidly devolving into two mutually unintelligible dialects, ordinary R and the Tidyverse. Sadly, this is a conscious effort by a commercial entity that has come to dominate the R world, RStudio. I know and admire the people at RStudio, but a commercial entity should not have such undue influence on an open-source project.
It might be more acceptable if the Tidyverse were superior to ordinary R, but in my opinion it is not. It makes things more difficult for beginners. E.g. the Tidyverse has so many functions, some complex, that must be learned to do what are very simple operations in base R. Pipes, apparently meant to help beginners learn R, actually make it more difficult, I believe. And the Tidyverse is of questionable value for advanced users.

Linked data structures

Likely win for Python.
Classical computer science data structures, e.g. binary trees, are easy to implement in Python. While this can be done in R using its 'list' class, I'd guess that it is slow.

R/Python interoperability

RStudio is to be commended for developing the reticulate package, to serve as a bridge between Python and R. It's an outstanding effort, and works well for pure computation. But as far as I can tell, it does not solve the knotty problems that arise in Python, e.g. virtual environments and the like.
At present, I do not recommend writing mixed Python/R code.

Friday, May 3, 2019

CMG #THEXCHANGE #VirtualConference is all about #scalability and #reliability

The next CMG #VirtualConference has just been announced. This next event is all about
#scalability and #reliability. No need to step out of the office, CMG is bringing the conference to
you. Don't miss out on the #THEXCHANGE virtual conference coming to an iPad or computer
near you on June 18, 2019. Reserve your “virtual seat” here: http://bit.ly/2UDj4IR #cmgnews

Wednesday, May 1, 2019

I will moderate #CMGExpert Roundtable on #AI - please join!

Please join us for our Roundtable on . Participants will engage in a small group discussion on the topic lead by moderator of . After the session, CMG will publish a piece recapping the session for the rest of the . Participants have the potential to connect with some of our influential , , and readers. Interested in attending CMG’s New Event Series? For more information and to register, click here: http://bit.ly/2DIZuBk

Friday, April 26, 2019

Under #cloud -y skies - #capacityPlanning in changing times ‐ Brian L Wong

Brian Wong is a Technology Fellow at Capital One and his talk was very much about the company’s transition to the cloud. Brian talked about how capacity planning is far from obsolete, but practicing it in a rapidly changing environment is entirely different from how it was practiced in the past. In his talk, Brian addressed containers, microservices, function-as-a-service, and fully managed services. He outlined the next-generation computing environment from an observability, capacity, and analytical perspective, and speculated on the form and value of capacity planning in these environments.

NB: I proud to work with him now!

Wednesday, January 16, 2019

My CMG IMPACT conference presentation is scheduled 2/20/19 Wednesday at 1:30pm - "Catching Anomaly and Normality in Cloud by Neural Net and Entropy Calculation"

See details here: https://cmgimpact.com/timetable/event/catching-anomalies-in-the-cloud/ 

UPDATE: That was a successful presentation, slides and video will available later for download.
By Igor Trubin, Capital One Bank
Part 1.  The Neural Network (NN) is not a new machine learning method. About 12 years ago I was involved as a Capacity Planning resource for the project of building an infrastructure (servers) to run NN for the fraud detection application. Now NN got much more attention and popularity as a part of AI, mostly because the computing power is increased dramatically and respectively more tasks can be done by using NN.
The goal of the presentation is  to demystify the technique in some simple terms and examples to show what it actually is and how that could be used for Capacity and Demand management. That is done by developing R code to recognize typical workload pasterns, like OLTP, or others in the time series performance data daily profiles.
Part 2. It is the typical concern to detect anomalies for short living objects or for the object with very small amount of measurements. Why? Number of those objects could be thousands and thousands so it is important to separate exceptional ones with anomalies for further investigation.  That could be servers or customers that have just started being monitored or public cloud objects (EC2s, ASGs) that usually have very short lifespan. Suggested approach to detect anomalous behavior of this type of objects is  to estimate the Entropy of the each object. If the entropy is low, everything should be in order and most likely OK. If not – there is a possible disorder there or mess and someone needs to check what is going on with the object. The method is implemented in the cloud based application written on R that scans every  hour all cloud Auto Scaling Groups (ASG) to detect imbalanced ones in term of number of EC2 instances in the group. That allows to separate a couple hundreds ASGs out of hundreds thousands of them.
This entropy based method is well known and it described in details in the following www.Trub.in  blog post:
“Quantifying Imbalance in Computer Systems” which is written based on CMG’12 paper.