Journée Thématique Big Data 13/03/2015

Size: px

Start display at page:

Download "Journée Thématique Big Data 13/03/2015"

Shannon Hensley
8 years ago
Views:

1 Journée Thématique Big Data 13/03/2015 1

2 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets Big? And How Big is Big? MapReduce And Spark Explained With Card Games 2

We Collect Browsing Data Way Beyond Our Clients Properties And Analyze Online Behaviors to Predict Intent Our clients properties Some figures in France We track online journeys of 150 Millions of

3 We Collect Browsing Data Way Beyond Our Clients Properties And Analyze Online Behaviors to Predict Intent Our clients properties Some figures in France We track online journeys of 150 Millions of cookies (of which, 50% mobile and the other half on Tablet/PC) on 1 million of websites in France (via http cookies with opt-out) Internet and detect, for each single cookie, about 5 to 30 visits each active day We analyze the context of these visits thanks to our automated classification engine (based on semantic analysis of pages content) 3

million of websites in France (via http cookies with opt-out) Internet and detect, for each single cookie, about 5 to 30 visits each

4 We Find the Best Moments to Engage a Conversation With Their Customers and Prospective Customers What we predict Complex and rare decisions taking months to consider, evaluate and decide Intention Purchase Decision Level Of Intent Examples: Car, Real-Estate Purchase, suscriptions to Insurance, Credit. Consideration Customer Decision Journey Not in our focus Impulse purchase Examples : ecommerce, Travel 4

Purchase Decision Level Of Intent Examples: Car, Real-Estate Purchase, suscriptions to Insurance,

Our Solutions Are Based On Modelling And Predicting Online Behaviors Data for client Acquisition Leads Scoring For Call-Centers Clients Scoring Price Sensitivity Scoring R&D Upper-funnel and

5 Our Solutions Are Based On Modelling And Predicting Online Behaviors Data for client Acquisition Leads Scoring For Call-Centers Clients Scoring Price Sensitivity Scoring R&D Upper-funnel and Mid-funnel targeting segments for RTB Paid Search Optimization Data Lead scoring for use in Call-Center to optimize lead conversions Churn scoring Cross-Sell and Up-Sell Scoring Scores to assess prospects / clients price sensitivity to improve product pricing 5

for RTB Paid Search Optimization Data Lead scoring for use in Call-Center to optimize lead conversions Churn

6 What Do We Want To Predict? We predict when a user fills an online form : client conversion We represent the conversion with a binary variable y {0,1} In the general case we call y the outputs (or responses) To predict conversion, we learn on features (e.g., browsing history) We represent these input variables (called features or predictors), with a d- dimensional vector x i R d For each example, the goal is to use the inputs to predict the value of the outputs. We call this exercise Supervised Learning 6

7 What Is The Machine Learning Theory Behind It? We consider the outputs y i = f x i We want to minimize risk E PX,Y l y, f x Where l is the Loss Function. In practice, we want to minimize the empiric risk (expectation is approximated by averaging over sample data) 1 n n i=1 l(y i, f x i ) After adding the regularization term, the goal is to solve the problem: n min[ 1 f n i=1 l y i, f x i + λ. Ω(f) ] Loss function Regularization 7

In practice, we want to minimize the empiric risk (expectation is approximated by averaging over sample

Events to Features building Features aggregation Features enrichment with temporal dimension Pattern

8 How Does It Work In Practice? Data Ingestion Data Processing Machine Learning Batch Update Quality control Jobs Automation and Monitoring Events to Features building Features aggregation Features enrichment with temporal dimension Pattern analysis Data formatting Features Engineering Model fitting Hybrid optimization 1 st epoch with SGD High-precision conversion with LBFGS What tools are available when dealing with small data? 8

9 What Is Happening When Data Gets Big?.....By The Way How Big is Big? Data Ingestion Data Processing Machine Learning 10+ GB Ingested daily Queries (Join, Sort ) on very large tables 300+ GB 10M lines sub-sampled dataset Features Very Sparse Features (e.g., 60 active features per line) How do traditional tools behave with such constraints? Needs distributed storage Too large to fit in RAM Too slow 9

large tables 300+ GB 10M lines sub-sampled dataset 30 000 Features Very Sparse Features (e.g., 60 active features per line) How do traditional tools behave with such constraints?

10 Our Tools Data Ingestion Data Processing Machine Learning Machine Learning at Scale + Interactive Analytics Exploration 10

11 Let s play a game! 11

12 Let s play a game! Method 1: Use the RAM

13 Let s play a game! 13

14 Let s play a game! Method 2: Sort the deck

15 Let s play a game! Method 3: Use MapReduce Split Map Partition Shuffle Sort Reduce,2,3,4,5, 6,8,9,10 7,2,3,4,5, 6,7,9,10 8,2,4,5, 6,7,8,9,10 3,2,3,4,5, 6,7,8,9,10 15

16 Let s play a game! Method 4: Use Spark

17 Wrapping up Used for : Machine Learning (Exploration) Simple to use Wide ML library You shall not scale Code can get messy Pros Cons 17

Wrapping up Used for : Data storage + Batch Processing + SQL Pros Scalable to 1000+ machines Failure Resilience

18 Wrapping up Used for : Data storage + Batch Processing + SQL Pros Scalable to machines Failure Resilience Mature Cons Heavy to handle MapReduce jobs are hard to write SQL can t do everything Very powerful data structures 18

Mature Cons Heavy to handle MapReduce jobs are hard to

write jobs Works (not only) with Hadoop Easy to interface with SQL Cons Young

19 Wrapping up FLAMSPARK Used for : Machine Learning at Scale + Interactive Analytics Very versatile Hard to master Pros Faster than MapReduce Easy to write jobs Works (not only) with Hadoop Easy to interface with SQL Cons Young Less stable Less featurecomplete Strong community, fast growth Many connectors (Python, R) 19

BigTable Proprietary 2006 2005 2006 Open Source 2007 2008 Open

20 (A brief, unexhaustive) History of Hadoop (and co.) Publishes papers on Google File System MapReduce BigTable Proprietary Open Source Open Source Open Source Open Source Open Source 20

21 Questions? 21

22 Our Machine Learning Algorithms Current explorations Our Models Our Optimizers Logistic Regression L2 regularization Distributed algorithms in Spark Hybrid models: - 1 st epoch with SGD - High-precision conversion with LBFGS (OW-LQN for non-smooths functions) Evaluation of Factorization machine and multinomial logistic regression Finalization of decision process modelling through a Hidden Markov Chain (HMM) model Features Engineering Timestamps added to our features Features selection: - Features frequencies and strengths - Noisy features removed - L1 regularization to select top features - Stepwise subset selection - User groups classification - Sites/pages classification (semisupervised) - User groups crossed with site groups - Feature hierarchizing and crossing Gradient Boosted Decision trees used to generate more complex features 22

23 Meet FlamSpark, Our Machine Learning Framework What is FlamSpark? A ML Framework sitting on top of Apache Spark and exploiting existing mathematical libraries (MLLib, Breeze) Why Spark? Our models in Python couldn t scale above 5 Million observations and 20,000 features. Distributed Computing was our preferred option to scale and Spark the best framework for this Raw Data Why FlamSpark? Simplify data manipulation Rapid ML model prototyping in Spark Support custom mathematical features beyond existing Libs (e.g., LBFGS & OWLQN optimizers, ElasticNet and Adaptative regularization, Factorization Machine Algorithm) ROC + Precision / Recall + Segments generation for use in RTB How does it scale? Run Logistic Regression with 100 Million observations and 500,000 features (on a 10 nodes cluster) 23

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14