Popular Post

_

Friday, June 14, 2019

#R-vs.-#Python-for-#DataScience

Re-posting interesting article from

R-vs.-Python-for-Data-Science

R vs. Python for Data Science

Norm Matloff, Prof. of Computer Science, UC Davis; my bio

Hello! This Web page is aimed at shedding some light on the perennial R-vs.-Python debates in the Data Science community. As a professional computer scientist and statistician, I hope to shed some useful light on the topic. I have potential bias — I've written 4 R-related books, and currently serve as Editor-in-Chief of the R Journal — but I hope this analysis will be considered fair and helpful.

Elegance

Clear win for Python.
This is subjective, of course, but having written (and taught) in many different programming languages, I really appreciate Python's greatly reduced use of parentheses and braces:
if x > y: 
   z = 5
   w = 8
vs.
if (x > y)
{ 
   z = 5
   w = 8
}
Python is sleek!

Learning curve

Huge win for R.
To even get started in Data Science with Python, one must learn a lot of material not in base Python, e.g., NumPy, Pandas and matplotlib.
By contrast, matrix types and basic graphics are built-in to base R. The novice can be doing simple data analyses within minutes. Python libraries can be tricky to configure, even for the systems-savvy, while most R packages run right out of the box.

Available libraries

Call it a tie.
CRAN has over 12,000 packages. PyPI has over 183,000, but it seems thin on Data Science.
For example, I once needed code to do fast calculation of nearest-neighbors of a given data point. (NOT code using that to do classification.) I was able to immediately find not one but two packages to do this. By contrast, just now I tried to find nearest-neighbor code for Python and at least with my cursory search, came up empty-handed; there was just one implementation that described itself as simple and straightforward, nothing fast.
The following searches in PyPI turned up nothing: log-linear model; Poisson regression; instrumental variables; spatial data; familywise error rate; etc.

Machine learning

Slight edge to Python here.
The Pythonistas would point to a number of very finely-tuned libraries, e.g. AlexNet, for image recognition. Good, but R versions easily could be developed. The Python libraries' power comes from setting certain image-smoothing ops, which easily could be implemented in R's Keras wrapper, and for that matter, a pure-R version of TensorFlow could be developed. Meanwhile, I would claim that R's package availabity for random forests and gradient boosting are outstandng.

Statisical correctness

Big win for R.
In my book, the Art of R Programming, I made the statement, "R is written by statisticians, for statisticians," which I'm pleased to see pop up here and there on occasion. It's important!
To be blunt, I find the machine learning people, who mostly advocate Python, often have a poor understanding of, and in some cases even a disdain for, the statistical issues in ML. I was shocked recently, for instance, to see one of the most prominent ML people, state in his otherwise outstanding book that standardizing the data to mean-0, variance-1 means one is assuming the data are Gaussian — absolutely false and misleading.

Parallel computation

Let's call it a tie.
Neither the base version of R nor Python have good support for multicore computation. Threads in Python are nice for I/O, but parallel computation using them is impossible, due to the infamous Global Interpreter Lock. Python's multiprocessing package is not a good workaround, nor is R's 'parallel' package. External libraries supporting cluster computation are OK in both languages.
Currently Python has better interfaces to GPUs.

C/C++ interface

Slight win for R.
Though there are tools like swig etc. for interfacing Python to C/C++, as far is I know there is nothing remotely as powerful as R's Rcpp for this at present. The Pybind11 package is being developed.
In addition, R's new ALTREP idea has great potential for enhancing performance and usability.
On the other hand, the Cython and PyPy variants of Python can in some cases obviate the need for explicit C/C++ interface in the first place.

Object orientation, metaprogramming

Slight win for R.
For instance, though functions are objects in both languages, R takes that more seriously than does Python. Whenever I work in Python, I'm annoyed by the fact that I cannot print a function to the terminal, which I do a lot in R.
Python has just one OOP paradigm. In R, you have your choice of several, though some may debate that this is a good thing.
Given R's magic metaprogramming features (code that produces code), computer scientists ought to be drooling over R.

Language unity

Horrible loss for R.
Python is currently undergoing a transition from version 2.7 to 3.x. This will cause some disruption, but nothing too elaborate.
By contrast, R is rapidly devolving into two mutually unintelligible dialects, ordinary R and the Tidyverse. Sadly, this is a conscious effort by a commercial entity that has come to dominate the R world, RStudio. I know and admire the people at RStudio, but a commercial entity should not have such undue influence on an open-source project.
It might be more acceptable if the Tidyverse were superior to ordinary R, but in my opinion it is not. It makes things more difficult for beginners. E.g. the Tidyverse has so many functions, some complex, that must be learned to do what are very simple operations in base R. Pipes, apparently meant to help beginners learn R, actually make it more difficult, I believe. And the Tidyverse is of questionable value for advanced users.

Linked data structures

Likely win for Python.
Classical computer science data structures, e.g. binary trees, are easy to implement in Python. While this can be done in R using its 'list' class, I'd guess that it is slow.

R/Python interoperability

RStudio is to be commended for developing the reticulate package, to serve as a bridge between Python and R. It's an outstanding effort, and works well for pure computation. But as far as I can tell, it does not solve the knotty problems that arise in Python, e.g. virtual environments and the like.
At present, I do not recommend writing mixed Python/R code.

Friday, May 24, 2019

#ControlChart are getting a common #visualization tool. Another example is here to my collection...

From ANADOT 

Friday, May 3, 2019

CMG #THEXCHANGE #VirtualConference is all about #scalability and #reliability

The next CMG #VirtualConference has just been announced. This next event is all about
#scalability and #reliability. No need to step out of the office, CMG is bringing the conference to
you. Don't miss out on the #THEXCHANGE virtual conference coming to an iPad or computer
near you on June 18, 2019. Reserve your “virtual seat” here: http://bit.ly/2UDj4IR #cmgnews
#THEXCHANGE





Wednesday, May 1, 2019

I will moderate #CMGExpert Roundtable on #AI - please join!

Please join us for our Roundtable on . Participants will engage in a small group discussion on the topic lead by moderator of . After the session, CMG will publish a piece recapping the session for the rest of the . Participants have the potential to connect with some of our influential , , and readers. Interested in attending CMG’s New Event Series? For more information and to register, click here: http://bit.ly/2DIZuBk

Friday, April 26, 2019

Under #cloud -y skies - #capacityPlanning in changing times ‐ Brian L Wong



Brian Wong is a Technology Fellow at Capital One and his talk was very much about the company’s transition to the cloud. Brian talked about how capacity planning is far from obsolete, but practicing it in a rapidly changing environment is entirely different from how it was practiced in the past. In his talk, Brian addressed containers, microservices, function-as-a-service, and fully managed services. He outlined the next-generation computing environment from an observability, capacity, and analytical perspective, and speculated on the form and value of capacity planning in these environments.

NB: I proud to work with him now!

Thursday, March 7, 2019

My YouTube channel "iTrubin" got 2000 subscribers !! Subscribe!! UPDATE :2024 - 5K subs!

Wednesday, January 16, 2019

My CMG IMPACT conference presentation is scheduled 2/20/19 Wednesday at 1:30pm - "Catching Anomaly and Normality in Cloud by Neural Net and Entropy Calculation"

See details here: https://cmgimpact.com/timetable/event/catching-anomalies-in-the-cloud/ 

UPDATE: That was a successful presentation, slides and video will available later for download.
By Igor Trubin, Capital One Bank
Part 1.  The Neural Network (NN) is not a new machine learning method. About 12 years ago I was involved as a Capacity Planning resource for the project of building an infrastructure (servers) to run NN for the fraud detection application. Now NN got much more attention and popularity as a part of AI, mostly because the computing power is increased dramatically and respectively more tasks can be done by using NN.
The goal of the presentation is  to demystify the technique in some simple terms and examples to show what it actually is and how that could be used for Capacity and Demand management. That is done by developing R code to recognize typical workload pasterns, like OLTP, or others in the time series performance data daily profiles.
Part 2. It is the typical concern to detect anomalies for short living objects or for the object with very small amount of measurements. Why? Number of those objects could be thousands and thousands so it is important to separate exceptional ones with anomalies for further investigation.  That could be servers or customers that have just started being monitored or public cloud objects (EC2s, ASGs) that usually have very short lifespan. Suggested approach to detect anomalous behavior of this type of objects is  to estimate the Entropy of the each object. If the entropy is low, everything should be in order and most likely OK. If not – there is a possible disorder there or mess and someone needs to check what is going on with the object. The method is implemented in the cloud based application written on R that scans every  hour all cloud Auto Scaling Groups (ASG) to detect imbalanced ones in term of number of EC2 instances in the group. That allows to separate a couple hundreds ASGs out of hundreds thousands of them.
This entropy based method is well known and it described in details in the following www.Trub.in  blog post:
“Quantifying Imbalance in Computer Systems” which is written based on CMG’12 paper.


Tuesday, December 11, 2018

CMG IMPACT conference discount #cmgnews

All -

I have promotional code for you in case you want to attend. Just respond and I'll give you a good discount!

Thank you!

Igor Trubin 


804-4611905

Wednesday, November 21, 2018

Great Customer Feedback about the #CapacityManagement Service I Provide


I have just got an internal feedback which has impressed me so I have decided to share that here as a good indication of making difference by Capacity Management service I provide to my customers:

..."Igor has helped us on a couple of instances, prevent potential disasters within our platform by the historical trends he’s tracking.  We know this since there was one occasion where our server crashed when we didn't react fast enough to Igor’s email warnings..."

Wednesday, November 14, 2018

AIXCHANGE (Virtual #CMGnews Event about #AI) November 27 @ 10:00 am - 3:00 pm EST

Attend this virtual conference all about artificial intelligence!
https://www.cmg.org/event/aixchange-virtual-event/
Conference rooms and office stand-ups are all buzzing. Artificial intelligence (AI) is emerging to represent one of the largest technological shifts across countless industries in recent history.
From driving manufacturing to supporting marketing to improving customer retention to improving ITops, AI is become the most in-demand technology of IT execs and quickly rising to the top of the list for IT investment in 2019.
According to recent reports, the machine learning market alone is anticipated to grow from $1.4B in 2017 to $8.8B by 2022.
CMG wants to help you to navigate this technological shift and keep you the smartest team member at the table. On November 27th, join CMG and its partners for AIXCHANGE. The latest of CMG’s virtual conference program will feature live presentations from companies and individuals leading in the AI space.
 
Scheduled Sessions
  • 10:00 AM - The History and Future of AI with Bryan Krouse
  • 11:00 AM - The Machines are Talking - The Future of AI and Chat for the Enterprise with Stephen Mallik
  • 12:00 PM - A Practical Guide for Information Discovery Using Machine Learning and Visualization
  • 1:00 PM - To Be Announced!
  • 2:00 PM - Soon AI will Test Everything with Jason Arbon



Tuesday, November 13, 2018

"Implementation and Interpretation of Control Charts in R" - qcc package


I was pointed to the nice resent on-line publication: https://datascienceplus.com/implementation-and-interpretation-of-control-charts-in-r/ 

"...Control charts are used during the Control phase of DMAIC methodology. Control charts, also known as Shewhart charts or process-behavior charts, are a statistical process control tool used to determine if a manufacturing or business process is in a state of control. If analysis of the control chart indicates that the process is currently under control, then no corrections or changes to process control parameters are needed. Moreover, data from the method can be used to predict the future performance of the process. If the control chart indicates that the process is not in control, analysis of the chart can help determine the sources of variation, as this will result in degradation of process performance..."




My comment:
When I had been developing SEDS (Performance Anomaly Detection System) long ago (years ago) I looked at that package (and referenced the link to my early CMG papers, BTW) even not knowing how to write R programs (now I can!)... They might improved that, but I did not find that time the way to do MASF type of control charts. I have even dreamed to build SETDS charts (IT-Control Charts)  package on a open source way. So my approach is the same but different, please read details in my paper:
 https://www.researchgate.net/publication/259486289_IT-Control_Chart  

Friday, October 19, 2018

#CMGnews: The training course Performance Anomaly Detection is now free for CMG Members. Visit the Membership Benefits Page for Access.

Be a CMG member and get the following benefits:
 https://www.cmg.org/members/member-benefits/ 
  • Access to a repository of more than over 1,000 research papers, presentations, and white papers from industry experts for the information you need for delivering improved results.
  • Free 1 hour webinars featuring industry experts and technology demos and recorded webinar access.
  • Free newsletters that feature the latest informative CMG news.
  • Quarterly issues of the CMG Journal
  • Discounted registration for CMG International-hosted events.
  • Access to CMG imPACt conference presentations and papers.
  • Access to our members-only videos channel with event videos, interviews and more.
  • Discounted registration for CMG’s international annual imPACt conference bringing together the best in the industry for four days of research, training, workshops, networking and practice sharing.
  • Access to our international Member Directory.
  • Access to CMG's Slack channel and LinkedIn to take part in performance and capacity discussions, hot topics, evaluations, information, networking, stay up to date with company news, and more!

Wednesday, October 17, 2018

#CMGnews: Registration is opened for #IMPACT2019 conference! Check discounts!

  • IMPACT 2019 will be an action-packed, 3-day conference filled with information and collaboration. Register today for #IMPACT2019 and take advantage of $100 off your conference pass with code FOB2019:  https://cmgimpact.com/ #cmgnews 

  • Act today and take $100 off your registration fee with code “FOB2019”. To find out more about content, sessions, and activities and what makes CMG’s IMPACT the best #technology conference on the planet, click here:  https://cmgimpact.com/ #cmgnews #IMPACT2019

  • Join hundreds of #industryleaders for CMG's 44th International Conference! #IMPACT2019 promises to be an exciting conference with great learning and networking opportunities. Discount code (ASK ME!) available to save $100 off a conference pass. Act today and save!  https://cmgimpact.com/ #cmgnews 

  • IMPACT 2019 #Conference sessions will educate and enlighten, enabling attendees to take a #leadership role in their own companies’ #digitaltransformations. Register today for #IMPACT2019 and take advantage of $100 off your conference pass with “FOB2019”: https://cmgimpact.com/ #cmgnews 

Tuesday, October 16, 2018

www.CMG.org Announces Launch Of Dynamic New Brand And Communications Platform



Computer Management Group (www.CMG.org), one of the worlds most influential organizations of IT professionals committed to digital transformation initiatives and best practices, is delighted to announce the launch of its new brand and communications platform—a platform built to showcase its measurement and management of computer systems and networks from a performance and capacity ... Read More »

Wednesday, October 10, 2018

#Opmantek product has implemented my SEDS method (MASF based #AnomalyDetection) and referenced my work positively

After I have published my previous post

#Opmantek product has implemented my Statistical Exception Detection System (SEDS) - MASF based #AnomalyDetection

Opmantek CTO has reached me out, confirmed they used my methodology and provided a very positive feedback:

"I am glad you contacted Opmantek about this blog, and we have updated it to include your name and a reference to one of the key blog articles.

I have been meaning to reach out to you to let you know about what we had done, technically the product is still in Beta but it seems the marketing team have pushed forward with making it generally available.

We found your work of great value as we looked through various methodologies for trending, in the end we implemented something based on SEDS with a few changes/additions, to be honest I would have to review the code to see what the differences are.

At the moment the product opTrend is working well enough, but we need to make some refinements and enhancements before it will be the first release.

Your research and publications are of great value and highly appreciated."

 
 

Friday, October 5, 2018

"#CapitalOne on #AWS" dedicated landing page

Thursday, October 4, 2018

#Opmantek product has implemented my Statistical Exception Detection System (SEDS) - MASF based #AnomalyDetection

Looks like one of the Opmantek product has implemented my dynamic thresholding method - SEDS. I found the reference to that in the following blog post at their site: 

System Automation Through Integration

"...These solutions were then complimented by the addition of opTrend, which expands on Opmantek’s already expansive thresholding and alerting system by implementing a highly flexible Statistical Exception Detection System (SEDS) that learns what’s normal behavior on the client’s network and adjusts thresholding dynamically based on historical usage for every hour of each day of the week..."

The description is limited, but apparently it is my SEDS method (MASF based Anomaly Detection) published in several white papers and blog posts.

I am happy except there is no reference to my name, papers or at least this blog. 

Thursday, September 27, 2018

#CMGnews: My talk, "Catching #Anomaly and Normality in Cloud by #NeuralNet and Entropy Calculation", has been selected for #CMGimpact 2019

ABSTRACT:

Part 1. The Neural Network (NN) is not a new machine learning method. About 12 years ago I was involved as a Capacity Planning resource for the project of building an infrastructure (servers) to run NN for the fraud detection application. Now NN got much more attention and popularity as a part of AI, mostly because the computing power is increased dramatically and respectively more tasks can be done by using NN.
The goal of the presentation is to demystify the technique in some simple terms and examples to show what it actually is and how that could be used for Capacity and Demand management. That is done by developing R code to recognize typical workload pasterns, like OLTP, or others in the time series performance data daily profiles.
Part 2. It is the typical concern to detect anomalies for short living objects or for the object with very small amount of measurements. Why? Number of those objects could be thousands and thousands so it is important to separate exceptional ones with anomalies for further investigation. That could be servers or customers that have just started being monitored or public cloud objects (EC2s, ASGs) that usually have very short lifespan. Suggested approach to detect anomalous behavior of this type of objects is to estimate the Entropy of the each object. If the entropy is low, everything should be in order and most likely OK. If not - there is a possible disorder there or mess and someone needs to check what is going on with the object. The method is implemented in the cloud based application written on R that scans every hour all cloud Auto Scaling Groups (ASG) to detect imbalanced ones in term of number of EC2 instances in the group. That allows to separate a couple hundreds ASGs out of hundreds thousands of them.
This entropy based method is well known and it described in details in the post: “Quantifying Imbalance in Computer Systems” which is written based on CMG’12 paper.

Tuesday, August 28, 2018

Catching Anomaly and Normality in Cloud by Neural Net and Entropy Calculation - #CMGnews


- I have submitted my next CMG (https://cmgimpact.com/) presentation and waiting for the acceptance. BTW I used some past CMG and this blog posts as a material for my presentation.

E.G. You may check it out using below links:

  Quantifying Imbalance in Computer Systems

UPDATE: The presentation is accepted. See abstract HERE.

Friday, July 20, 2018

#CMGnews - You can now register to #imPACt Conference in Seattle FEB19-21 2018