R YOU READY FOR PYTHON? Sunday 19th April, 2015
THIS IS NOT A PYTHON VS R TALK credits - https://meetmrholland.wordpress.com/2013/02/03/creative-5-tips-to-make-all-your-meetings-exactly-the-same/
WHO ARE WE? Danilo Maurizio (@danailon) Advanced Analytics Division Manager* @Horsa data must be attended to give them confidence Gianluca Emireni (@gianlucaemireni) Advanced Analytics Senior Consultant* @Horsa a life spent between good (statistics) and evil (IT) "@horsa.it, ".join(["danilo.maurizio","gianluca.emireni",""])[:-2]
WHO ARE WE? :-)
THE MOST COMPLETE REFERENCE :-) http://blog.datacamp.com/how-to-speak-data-science/ Data scientist Average programmer looking for a job that pays as much as what a top programmer would get. Sometimes also goes by the name data analyst. Statistician Mathematician who can t program. Correlation does not imply causation We looked at the wrong data set and can t draw any conclusions from it. Often represented in a graph to create the illusion of adding value. Machine Learning Statistical technique used by the sales and marketing department of big data vendors to secure their yearly bonus. (also see Our company is big data ready ) Hadoop Open-source software used for distributed computing. Data Scientists seem to have a quota to drop the name every two sentences when talking big data, but most only know the logo is a yellow elephant. There is a significant effect but Sentence-start used by data scientists or statisticians when they ve put weeks of work into their analysis, the results look fishy and not as expected, and there is no time to redo the analysis.
spurious correlation - http://www.tylervigen.com/view_correlation?id=1703 ON TORTURING DATA
WE PREFER THIS: DRAWING INFERENCES FROM THE DATA Data scientist is sometimes used as an excuse for ignorance, as in I don t understand probability and all that stuff, but I don t need to because I m a data scientist, not a statistician Data science could be a useful umbrella term for statistics, machine learning, decision theory, etc. Also, the title data scientist is rightfully associated with people who have better computational skills than statisticians typically have John D. Cook: http://www.johndcook.com/blog/2015/03/30/label-data-science/
WHAT WE DO We draw inference from data using: R python legacy stats suite (SAS, SPSS, SAP predictive stack, STATA)
COMFORTABLE WITH OPEN SOURCE project python pandas R RStudio Spark estimated effort (COCOMO model) $ 15 Mio. $ 2.5 Mio. $ 12 Mio. $ 6 Mio. $ 5.3 Mio. www.ohloh.net
more and more often we turn around the central question: how to balance and mix the best of R and python?
WHEN PYTHON We slightly and slowly moved all of our data management towards python (etl and data movement)
AND WHEN R while being tied to R for statistical learning
PURSUING THE BEST BALANCE https://www.flickr.com/photos/alexthebarman/9311523633
RANDOM FORESTS THROUGH SCIKIT- LEARN ON REVERSE LOGISTIC Fashion e-commerce has a huge problem with return rate - most of us have wives and credit cards :-) Purchase history and carts have enough information to train a model? (we know that features design/selection is the most time consuming activity) We tried hard and succeeded :-) Most important features: fit index, cart entropy, past shoppers attitude, transaction value, product quantity, Now deployed real time as a web service with milliseconds response time using: flask, circus, scikit-learn used to dynamically set shipping price
WHY PYTHON? We love caret library, an R counterpart of scikit-learn, but is much easier web serving this kind of model on python stack. 150 lines of code are enough to deliver json document with shopping bags return probability.
WHY R? In some contexts, CRAN shop (R libraries repository) offer very mature packages able to solve the whole class of statistical problems. For example, time series forecasting and statistical matching (causal inference) are kind of problems where R outperforms python in terms of completeness, documentation, deepness, The forecast library (Rob J. Hyndman) is the best in class package for time series modeling, not only for code quality but also for its theoretical and methodological support. The MatchIt and CEM library (Iacus, King, Porro) offer a full range of techniques to perform statistical matching.
flickr CC - https://www.flickr.com/photos/hanskool/16581407815/ BETTER TOGETHER
WHY TOGETHER? The wide range of problems we are called to face let us use both languages together: sometimes python leads the analysis, other times is R. R and python glued together: http://www.programmingr.com/content/calling-python-r-rpython/ https://sites.google.com/site/aslugsguidetopython/data-analysis/pandas/calling-r-frompython
flickr CC - https://www.flickr.com/photos/ryanready/5236405051 WE ARE IN GOOD COMPANY
DO YOU KNOW STACKOVERFLOW? How many times, googling for help, you have been led to stackoverflow? Did you notice that R users and pydata users meet together in a huge number of threads?
STACKOVERFLOW - #PYDATA
Disclaimer: not done with matplotlib YET ANOTHER TAG CLOUD #1/2
STACKOVERFLOW - #RSTATS
Disclaimer: not done with ggplot2 YET ANOTHER TAG CLOUD #2/2
QUESTIONS How wide is the area of stackoverflow users active in both python and R Q&A? Is there any difference among the behavior of polyglots and their purists colleagues? Are they finally any smarter?
DATA CAPTURE Thanks to StackExchange data explorer* we designed a bunch of query to harvest information about questions and answers related to tags associated with python data stack (scipy, numpy, pandas, scikit-learn) and R (r, rstat, r-faq). On top of this data we built a user registry, labeling each user according to their joint (in)activity in these two different domains. (*) https://data.stackexchange.com/stackoverflow/query/new
A STEADY GROWTH
GROUPS OVERLAP users Pythonistas
POLYGLOTS Stackoverflow users you can find in both groups
AND THEIR ACTIVITY users Pythonistas Q&A volumes
AVERAGE INTERACTIONS PER USER
WHO OWNS THE KNOWLEDGE? Pareto point of view for Pythonistas
WHO OWNS THE KNOWLEDGE? Pareto point of view for Pythonistas _Pythonistats are users active on stackoverflow for #pydata related #tags _80% of total amount of answers is given by 2.000 (core) users _more than 10.000 users have never answered to a question
WHO OWNS THE KNOWLEDGE? Pareto point of view for users
WHO OWNS THE KNOWLEDGE? Pareto point of view for users _users are users active on stackoverflow for #rstats related #tags _80% of total amount of answers is given by less than 1.000 (core) users _more than 20.000 users have never answered to a question
POLYGLOTS ARE PROBLEM SOLVERS? We tried to explain the probability of a Question being successfully closed with an accepted Answer, using some regressors: year and month of the question view counts, score, favorites and comments summed by post number of answers given by polyglots and number of answers given by Pythonistas/useRs plus some tags used as dummies The main question is: are polyglots smarter than R and python purists?
POLYGLOTS VS PYTHONISTAS While Pythonistas answers are likely to reduce the probability of the post being closed, the effect of polyglots contributions to posts is statistically significantly greater than zero, having a positive effect.
POLYGLOTS vs users Also R based posts benefit of polyglots interventions, showing a positive effect on the probability of successful post closing.
TOWARDS A LESS TAUTOLOGICAL QUESTION flickr cc -https://www.flickr.com/photos/bensonk42/5227502114/
HOW LONG DOES IT TAKES TO ANSWER A QUESTION? How Pythonistas, users and Polyglots presence affects questions lifetime?
YOUR #PYDATA QUESTIONS WILL BE ANSWERED IN 200* MINUTES OR NEVER Questions are solved mostly immediately, 50% closed within 40 minutes *3rd quartile
PYTHON QUESTIONS LIFETIME Polyglots presence contribute on reducing question resolution Kaplan-Meier survival estimates
YOUR #RSTATS QUESTIONS WILL BE ANSWERED IN 136 MINUTES OR NEVER Questions are solved mostly immediately, 50% closed within 32 minutes *3rd quartile
R QUESTIONS LIFETIME users and Polyglots presence ensure the lowest response time Kaplan-Meier survival estimates
PYTAGS QUESTIONS LIFETIME 1/3
PYTAGS QUESTIONS LIFETIME 2/3
PYTAGS QUESTIONS LIFETIME 3/3
from greeting import thankyou