Learning at scale on Hadoop

Similar documents

Tuning WebSphere Application Server ND 7.0. Royal Cyber Inc.

Hadoop Ecosystem B Y R A H I M A.

Journée Thématique Big Data 13/03/2015

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Case Study : 3 different hadoop cluster deployments

Graylog2 Lennart Koopmann, OSDC /

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

A Performance Analysis of Distributed Indexing using Terrier

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

OBSERVEIT DEPLOYMENT SIZING GUIDE

Taking the First Steps in. Web Load Testing. Telerik

Lecture 10 - Functional programming: Hadoop and MapReduce

Benchmarking Hadoop & HBase on Violin

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Using distributed technologies to analyze Big Data

How Companies are! Using Spark

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

HiBench Introduction. Carson Wang Software & Services Group

Data Pipeline with Kafka

Zing Vision. Answering your toughest production Java performance questions

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Spark: Cluster Computing with Working Sets

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Big-data Analytics: Challenges and Opportunities

Applying Apache Hadoop to NASA s Big Climate Data!

Big Data Infrastructure at Spotify

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Delivering Quality in Software Performance and Scalability Testing

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Unified Big Data Processing with Apache Spark. Matei

Fast Analytics on Big Data with H20

Experiences with Lustre* and Hadoop*

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Rakam: Distributed Analytics API

HADOOP MOCK TEST HADOOP MOCK TEST II

Microsoft Dynamics NAV 2013 R2 Sizing Guidelines for On-Premises Single Tenant Deployments

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Tableau Server Scalability Explained

Building your Big Data Architecture on Amazon Web Services

The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

Efficient Network Marketing - Fabien Hermenier A.M.a.a.a.C.

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

WEBLOGIC ADMINISTRATION

The Greenplum Analytics Workbench

RevoScaleR Speed and Scalability

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Improve performance and availability of Banking Portal with HADOOP

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Storage Sizing Issue of VDI System

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Understanding Hadoop Performance on Lustre

Massive Cloud Auditing using Data Mining on Hadoop

NoSQL for SQL Professionals William McKnight

KNIME & Avira, or how I ve learned to love Big Data

Big Data for Big Intel

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Jérôme Lepage - CFCamp Mondadori France

HADOOP AT NOKIA JOSH DEVINS, NOKIA HADOOP MEETUP, JANUARY 2011 BERLIN

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

GraySort on Apache Spark by Databricks

Open source Google-style large scale data analysis with Hadoop

Hypertable Goes Realtime at Baidu. Yang Dong Sherlock Yang(

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Architecture. Part 1

L3: Statistical Modeling with Hadoop

SDFS Overview. By Sam Silverberg

System Requirements Table of contents

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Analytics on Spark &

Spark: Making Big Data Interactive & Real-Time

the missing log collector Treasure Data, Inc. Muga Nishizawa

Bringing Big Data Modelling into the Hands of Domain Experts

Tutorial: Load Testing with CLIF

H2O on Hadoop. September 30,

Ground up Introduction to In-Memory Data (Grids)

Application of Predictive Analytics for Better Alignment of Business and IT

Characterizing Task Usage Shapes in Google s Compute Clusters

Performance Monitor. Intellicus Web-based Reporting Suite Version 4.5. Enterprise Professional Smart Developer Smart Viewer

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside

Can t We All Just Get Along? Spark and Resource Management on Hadoop

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Transcription:

Learning at scale on Hadoop Olivier Toromanoff, Software engineer Berlin Buzzwords, 2015-06-01 Copyright 2014 Criteo

Prediction @ Criteo Learning on Hadoop Limits are lower than the sky From Experimentation to Production: a model lifecycle Copyright 2014 Criteo

Criteo: Performance advertising «The right ad at the right time to the right user» Background Layout Opt out link Slogans WARM MEETS LIGHT ALL ORIGINALS #REPRESENT SWEET NOTHING ADDIDAS IS ALL IN Call to actions all original #represent JOIN NOW JOIN NOW JOIN NOW SHOP NOW SHOP NOW SHOP NOW JOIN NOW Butons SEE MORE SEE MORE CLICK HERE CLICK HERE 6ms Colours

Criteo: Performance advertising Client Publisher CPC CP M * CPC

Criteo: Some numbers +12 K 6 DATA CENTERS MORE THAN 12 000 SERVERS PEAK TRAFFIC (PER SECOND) 800K HTTP REQUESTS CDH4 CLUSTER OF 1200 NODES (36TB, 96GB RAM, 24 cores)

The display cycle Learning Displ ay At real time, we predict for: - Campaign selection - Bidding - Product recommendation - Layout customization Logs generation User event : Click / Sales Tracking

Click prediction modelling = (logistic) = (geometric)

The nice sides of Logistic Regression Convex Optimisation Solvable with iterative Gradient Descent Algorithms (L-BFGS) Fast Prediction at runtime

Prediction @ Criteo Learning on Hadoop Limits are lower than the sky From Experimentation to Production: a model lifecycle Copyright 2014 Criteo

Need for Scale Learning a model : 7 days of data 30 000 000 clicks / 570 000 000 displays ~ 7 TB compressed 190 000 000 lines ~ 90 features / lines and we have ~200 of them (click/sales/ x DCs x ABTests).. and we want to refresh as much as possible..

But a strong «legacy» Existing in-house Machine Learning library in C# Front-end code in C# Do we really want to: Re-implement missing functionalities in open source library? Rewrite all the learning code base from scratch? Take the risk to introduce and maintain a cross language duplication? hmmm.. let s try to run C# on Hadoop!

Let s meet «Recor drea der Recor drea der in» Streamin g stdi n mono MyMapper.exe stdou t Shuf e stdi n stdi n mono MyMapper.exe stdou t mono MyReducer.exe stdou t HDFS

Taming the beast Limitation on arguments length Fitting Mono VM + JVM into container memory Additional mini sub-processes in Java to interact with HDFS Stumbling upon Mono NotImplementedException Still (rare) issues with Mono: Crashes when Garbage Collector under heavy load Hanging threads

But iterative? You said iterative? Data + Solution n MAP Converge in 10 to 50 iterations MAP REDUC E Solution n+1 MAP

Let s «all-reduce» Mappe r Mappe r Mappe r Reduce r Cache on Reduce r Cache on disk Reduce r Cache on disk Reduce r Cache on disk Reduce r Cache on disk disk ZooKeep er

Distributing ain t easy No resilience to reducer crashes Keeping reducers for up to 3h Processing will start only when all reducers are provisioned

One (production) year later ~ 1300 models/day Ingesting 596 TB/day Consuming 6310 CPU day/day Learning time: [10min; 3h] Refresh rate: [3h; 6h] Last two weeks success rate: 97.4%

Prediction @ Criteo Learning on Hadoop Limits are lower than the sky From Experimentation to Production: a model lifecycle Copyright 2014 Criteo

Big cluster, small gateway Jobs are all launched from a single machine Asynchronous jobs in Yarn containers

Wasting resources is easy ( and painful) We started to see jobs waiting containers for hours Loooots of small mappers CombineFileInputFormat

What about further scaling? 98% success rate is good for now What about 2x more models, more data, more reducers,

Prediction @ Criteo Learning on Hadoop Limits are lower than the sky From Experimentation to Production: a model lifecycle Copyright 2014 Criteo

Hello «TestFramework»! An internal tool that replays Prod traffic from logs Used to: Train models Exercise models Compute metrics

Prediction analytics Observations: Basic (clicks, ) Prediction (Observed Ctr, Logged PCtr, Simulated PCtr) Metrics: Basic (count, ) ML (MSE, LLH, ) ML+Business (MSE_weightedBy*, ) Business (Advertiser added value, Criteo gross, )

Prediction analytics: MR job Mapper s Parsing/Filtering/Transformation Parsing/Filtering/Transformation dim1=mod1; dim1=mod1; dimn=modn dimn=modn Historic al Prod Logs Prediction Prediction dim1=mod1; dim1=mod1; dimn=modn; dimn=modn; Pred1=p1 Pred1=p1 PreAggregation PreAggregation SELECT SELECT metrics(observations) metrics(observations) GROUP GROUP BY BY dimensions dimensions PreAggrega ted Result Reduc er Final Aggregation Resul t

Offline ABTesting My model improved this fancy MSE_weightedBy430 yeah!.. weeks of productification work latter, IRL it under Counterfactual analysis: performed :( Online exploration allows us to do ofine evaluation of how the tested model would have performed

From Offline Test to ABTest

ABTest monitoring Positive

In metrics we trust My ABTest: +0.3% on the first 2 hours yeah!.. but 2 days after: -1% was it just noise? Computing confidence interval using bootstrapping: By computing metrics from several instances of the dataset generated with random sampling, we get accuracy measures

Wrapping-up Prediction at the core of Criteo s platform Thanks to Hadoop, we could greatly distribute our learnings: 1300 models learnt from 600TB daily Even with somewhat unorthodox implementation: 97.4% success rate mono + Hadoop Streaming all-reduce Some resources optimization needed to scale An integrated testbed from Ofine ABTest to monitoring metrics