R YOU READY FOR PYTHON? Sunday 19th April, 2015

Similar documents
DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

ANALYTICS CENTER LEARNING PROGRAM

Unlocking the True Value of Hadoop with Open Data Science

R Tools Evaluation. A review by Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Databricks. A Primer

The Data Engineer. Mike Tamir Chief Science Officer Galvanize. Steven Miller Global Leader Academic Programs IBM Analytics

Big Data Paradigms in Python

Google AdWords vs Google Analytics: Dissecting Remarketing Lists. Written by Carrie Albright, Senior Account Manager. hanapinmarketing.

Databricks. A Primer

How To Write A Data Analysis Project

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

Data Science Certificate Program

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

SAP SE - Legal Requirements and Requirements

What is Data Science? Girl Develop It! Meetup Renée M. P. Teate, March 2015

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

How To Perform Predictive Analysis On Your Web Analytics Data In R 2.5

Session 85 IF, Predictive Analytics for Actuaries: Free Tools for Life and Health Care Analytics--R and Python: A New Paradigm!

Big Data and Data Science: Behind the Buzz Words

Confidently Anticipate and Drive Better Business Outcomes

Ibis: Scaling Python Analy=cs on Hadoop and Impala

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Extend your analytic capabilities with SAP Predictive Analysis

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Digital Analytics Checkup:

Maximize Revenues on your Customer Loyalty Program using Predictive Analytics

Easily Identify Your Best Customers

RESEARCH NOTE NETSUITE S IMPACT ON MANUFACTURING COMPANY PERFORMANCE

Explode Six Direct Marketing Myths

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Business Plan Strategy. John Debrincat

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Cloud Big Data Architectures

Machine Learning for Understanding User Behaviours. Semi-Supervised Learning Applied to Click Streams

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

Big Analytics: A Next Generation Roadmap

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

RESEARCH NOTE NETSUITE S IMPACT ON E-COMMERCE COMPANIES

Certificate Program in Applied Big Data Analytics in Dubai. A Collaborative Program offered by INSOFE and Synergy-BI

KnowledgeSEEKER POWERFUL SEGMENTATION, STRATEGY DESIGN AND VISUALIZATION SOFTWARE

Customer Case Study. Automatic Labs

SEYMOUR SLOAN IDEAS THAT MATTER

ANACONDA. Open Source Modern Analytics Platform Powered by Python ANACONDA DELIVERS OPEN ENTERPRISE PYTHON KEY FEATURES WHY YOU LL LOVE ANACONDA

SOCIAL MEDIA CAMPAIGNS

THE THREE "Rs" OF PREDICTIVE ANALYTICS

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

Big Data to trade bonds/fx & Python demo on FX intraday vol

A Simple Guide to Churn Analysis

SAP HANA Vora : Gain Contextual Awareness for a Smarter Digital Enterprise

Johan Hallberg Research Manager / Industry Analyst IDC Nordic Services & Sourcing Digital Transformation Global CIO Agenda

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Intermediate Advanced All

Quantified Self: Analyzing the Big Data of our Daily Life. Andreas Schreiber PyData Berlin 2014

web analytics ...and beyond Not just for beginners, We are interested in your thoughts:

ESS event: Big Data in Official Statistics

Auto Days 2011 Predictive Analytics in Auto Finance

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

IBM Cognos Business Intelligence on Cloud

High Performance Predictive Analytics in R and Hadoop:

FOR SALE BY OWNER LIST PRICE $199,999 CONTACT: CRAIG DAVIDIUK, PRESIDENT TEL:

Getting to Know Your Online Donors Can Pay Off

ANALYTICS IN BIG DATA ERA

Credit Risk Analysis Using Logistic Regression Modeling

Data Visualization Techniques

Datameer Cloud. End-to-End Big Data Analytics in the Cloud

R and Hadoop: Architectural Options. Bill Jacobs VP Product Marketing & Field CTO, Revolution

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Statistics Meets Big Data 統 計 遇 見 大 數 據

whitepaper Predictive Analytics with TIBCO Spotfire and TIBCO Enterprise Runtime for R

Predictive Analytics with TIBCO Spotfire and TIBCO Enterprise Runtime for R

Independent process platform

The Dating Guide to SEO

An interdisciplinary model for analytics education

An In-Depth Look at In-Memory Predictive Analytics for Developers

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Data Science, Predictive Analytics & Big Data Analytics Solutions. Service Presentation

Data Visualization Techniques

Introduction to Python

Disrupting The Market: Predictive Analytics As A Service

SAP Solution Brief SAP HANA. Transform Your Future with Better Business Insight Using Predictive Analytics

SAP Predictive Analytics

Transcription:

R YOU READY FOR PYTHON? Sunday 19th April, 2015

THIS IS NOT A PYTHON VS R TALK credits - https://meetmrholland.wordpress.com/2013/02/03/creative-5-tips-to-make-all-your-meetings-exactly-the-same/

WHO ARE WE? Danilo Maurizio (@danailon) Advanced Analytics Division Manager* @Horsa data must be attended to give them confidence Gianluca Emireni (@gianlucaemireni) Advanced Analytics Senior Consultant* @Horsa a life spent between good (statistics) and evil (IT) "@horsa.it, ".join(["danilo.maurizio","gianluca.emireni",""])[:-2]

WHO ARE WE? :-)

THE MOST COMPLETE REFERENCE :-) http://blog.datacamp.com/how-to-speak-data-science/ Data scientist Average programmer looking for a job that pays as much as what a top programmer would get. Sometimes also goes by the name data analyst. Statistician Mathematician who can t program. Correlation does not imply causation We looked at the wrong data set and can t draw any conclusions from it. Often represented in a graph to create the illusion of adding value. Machine Learning Statistical technique used by the sales and marketing department of big data vendors to secure their yearly bonus. (also see Our company is big data ready ) Hadoop Open-source software used for distributed computing. Data Scientists seem to have a quota to drop the name every two sentences when talking big data, but most only know the logo is a yellow elephant. There is a significant effect but Sentence-start used by data scientists or statisticians when they ve put weeks of work into their analysis, the results look fishy and not as expected, and there is no time to redo the analysis.

spurious correlation - http://www.tylervigen.com/view_correlation?id=1703 ON TORTURING DATA

WE PREFER THIS: DRAWING INFERENCES FROM THE DATA Data scientist is sometimes used as an excuse for ignorance, as in I don t understand probability and all that stuff, but I don t need to because I m a data scientist, not a statistician Data science could be a useful umbrella term for statistics, machine learning, decision theory, etc. Also, the title data scientist is rightfully associated with people who have better computational skills than statisticians typically have John D. Cook: http://www.johndcook.com/blog/2015/03/30/label-data-science/

WHAT WE DO We draw inference from data using: R python legacy stats suite (SAS, SPSS, SAP predictive stack, STATA)

COMFORTABLE WITH OPEN SOURCE project python pandas R RStudio Spark estimated effort (COCOMO model) $ 15 Mio. $ 2.5 Mio. $ 12 Mio. $ 6 Mio. $ 5.3 Mio. www.ohloh.net

more and more often we turn around the central question: how to balance and mix the best of R and python?

WHEN PYTHON We slightly and slowly moved all of our data management towards python (etl and data movement)

AND WHEN R while being tied to R for statistical learning

PURSUING THE BEST BALANCE https://www.flickr.com/photos/alexthebarman/9311523633

RANDOM FORESTS THROUGH SCIKIT- LEARN ON REVERSE LOGISTIC Fashion e-commerce has a huge problem with return rate - most of us have wives and credit cards :-) Purchase history and carts have enough information to train a model? (we know that features design/selection is the most time consuming activity) We tried hard and succeeded :-) Most important features: fit index, cart entropy, past shoppers attitude, transaction value, product quantity, Now deployed real time as a web service with milliseconds response time using: flask, circus, scikit-learn used to dynamically set shipping price

WHY PYTHON? We love caret library, an R counterpart of scikit-learn, but is much easier web serving this kind of model on python stack. 150 lines of code are enough to deliver json document with shopping bags return probability.

WHY R? In some contexts, CRAN shop (R libraries repository) offer very mature packages able to solve the whole class of statistical problems. For example, time series forecasting and statistical matching (causal inference) are kind of problems where R outperforms python in terms of completeness, documentation, deepness, The forecast library (Rob J. Hyndman) is the best in class package for time series modeling, not only for code quality but also for its theoretical and methodological support. The MatchIt and CEM library (Iacus, King, Porro) offer a full range of techniques to perform statistical matching.

flickr CC - https://www.flickr.com/photos/hanskool/16581407815/ BETTER TOGETHER

WHY TOGETHER? The wide range of problems we are called to face let us use both languages together: sometimes python leads the analysis, other times is R. R and python glued together: http://www.programmingr.com/content/calling-python-r-rpython/ https://sites.google.com/site/aslugsguidetopython/data-analysis/pandas/calling-r-frompython

flickr CC - https://www.flickr.com/photos/ryanready/5236405051 WE ARE IN GOOD COMPANY

DO YOU KNOW STACKOVERFLOW? How many times, googling for help, you have been led to stackoverflow? Did you notice that R users and pydata users meet together in a huge number of threads?

STACKOVERFLOW - #PYDATA

Disclaimer: not done with matplotlib YET ANOTHER TAG CLOUD #1/2

STACKOVERFLOW - #RSTATS

Disclaimer: not done with ggplot2 YET ANOTHER TAG CLOUD #2/2

QUESTIONS How wide is the area of stackoverflow users active in both python and R Q&A? Is there any difference among the behavior of polyglots and their purists colleagues? Are they finally any smarter?

DATA CAPTURE Thanks to StackExchange data explorer* we designed a bunch of query to harvest information about questions and answers related to tags associated with python data stack (scipy, numpy, pandas, scikit-learn) and R (r, rstat, r-faq). On top of this data we built a user registry, labeling each user according to their joint (in)activity in these two different domains. (*) https://data.stackexchange.com/stackoverflow/query/new

A STEADY GROWTH

GROUPS OVERLAP users Pythonistas

POLYGLOTS Stackoverflow users you can find in both groups

AND THEIR ACTIVITY users Pythonistas Q&A volumes

AVERAGE INTERACTIONS PER USER

WHO OWNS THE KNOWLEDGE? Pareto point of view for Pythonistas

WHO OWNS THE KNOWLEDGE? Pareto point of view for Pythonistas _Pythonistats are users active on stackoverflow for #pydata related #tags _80% of total amount of answers is given by 2.000 (core) users _more than 10.000 users have never answered to a question

WHO OWNS THE KNOWLEDGE? Pareto point of view for users

WHO OWNS THE KNOWLEDGE? Pareto point of view for users _users are users active on stackoverflow for #rstats related #tags _80% of total amount of answers is given by less than 1.000 (core) users _more than 20.000 users have never answered to a question

POLYGLOTS ARE PROBLEM SOLVERS? We tried to explain the probability of a Question being successfully closed with an accepted Answer, using some regressors: year and month of the question view counts, score, favorites and comments summed by post number of answers given by polyglots and number of answers given by Pythonistas/useRs plus some tags used as dummies The main question is: are polyglots smarter than R and python purists?

POLYGLOTS VS PYTHONISTAS While Pythonistas answers are likely to reduce the probability of the post being closed, the effect of polyglots contributions to posts is statistically significantly greater than zero, having a positive effect.

POLYGLOTS vs users Also R based posts benefit of polyglots interventions, showing a positive effect on the probability of successful post closing.

TOWARDS A LESS TAUTOLOGICAL QUESTION flickr cc -https://www.flickr.com/photos/bensonk42/5227502114/

HOW LONG DOES IT TAKES TO ANSWER A QUESTION? How Pythonistas, users and Polyglots presence affects questions lifetime?

YOUR #PYDATA QUESTIONS WILL BE ANSWERED IN 200* MINUTES OR NEVER Questions are solved mostly immediately, 50% closed within 40 minutes *3rd quartile

PYTHON QUESTIONS LIFETIME Polyglots presence contribute on reducing question resolution Kaplan-Meier survival estimates

YOUR #RSTATS QUESTIONS WILL BE ANSWERED IN 136 MINUTES OR NEVER Questions are solved mostly immediately, 50% closed within 32 minutes *3rd quartile

R QUESTIONS LIFETIME users and Polyglots presence ensure the lowest response time Kaplan-Meier survival estimates

PYTAGS QUESTIONS LIFETIME 1/3

PYTAGS QUESTIONS LIFETIME 2/3

PYTAGS QUESTIONS LIFETIME 3/3

from greeting import thankyou