Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony

Similar documents

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Open source Google-style large scale data analysis with Hadoop

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Map Reduce & Hadoop Recommended Text:

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Apache Hadoop. Alexandru Costan

Large scale processing using Hadoop. Ján Vaňo

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop implementation of MapReduce computational model. Ján Vaňo

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

A very short Intro to Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop & its Usage at Facebook

Hadoop IST 734 SS CHUNG

MapReduce, Hadoop and Amazon AWS

Hadoop & Spark Using Amazon EMR

Big Business, Big Data, Industrialized Workload

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data With Hadoop

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

BIG DATA TRENDS AND TECHNOLOGIES

Drupal in the Cloud. Scaling with Drupal and Amazon Web Services. Northern Virginia Drupal Meetup

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

HDFS Cluster Installation Automation for TupleWare

The Inside Scoop on Hadoop

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Scaling Out With Apache Spark. DTL Meeting Slides based on

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

How To Scale Out Of A Nosql Database

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop & its Usage at Facebook

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Scalable Application. Mikalai Alimenkou

Big Data on Microsoft Platform

Amazon Elastic Compute Cloud Getting Started Guide. My experience

From Wikipedia, the free encyclopedia

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Scalable Architecture on Amazon AWS Cloud

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Apache Hadoop new way for the company to store and analyze big data

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

How To Use Hadoop

Big data blue print for cloud architecture

Big Data Infrastructure at Spotify

Big Fast Data Hadoop acceleration with Flash. June 2013

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Building your Big Data Architecture on Amazon Web Services

Cloud computing - Architecting in the cloud

Introduction to MapReduce and Hadoop

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Using Cloud Services for Test Environments A case study of the use of Amazon EC2

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

A programming model in Cloud: MapReduce

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Chapter 7. Using Hadoop Cluster and MapReduce

Introduction to Hadoop

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

How To Create A Data Visualization With Apache Spark And Zeppelin

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Parallel Data Processing

Case Study : 3 different hadoop cluster deployments

Getting Started with Hadoop with Amazon s Elastic MapReduce

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Hadoop and Map-Reduce. Swati Gore

Cloud Computing using MapReduce, Hadoop, Spark

CONFIGURING ECLIPSE FOR AWS EMR DEVELOPMENT

So What s the Big Deal?

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Lecture 10 - Functional programming: Hadoop and MapReduce

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

GeoCloud Project Report USGS/EROS Spatial Data Warehouse Project

A Cost-Evaluation of MapReduce Applications in the Cloud

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi

NoSQL and Hadoop Technologies On Oracle Cloud

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

HDFS. Hadoop Distributed File System

Big Data and Market Surveillance. April 28, 2014

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Duke University

Transcription:

Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony Speaker logo centered below image Steve Kuo, Software Architect Joshua Tuberville, Software Architect

Goal > Leverage EC2 and Hadoop to scale data intensive problem that are exceeding the limits of our data center and data warehouse environment. 2

Agenda > Background and Motivation > Hadoop Overview > Hadoop at eharmony > Architecture > Tools and Performance > Roadblocks > Future Directions > Summary > Questions 3

About eharmony > Online subscription-based matchmaking service > Launched in 2000 using compatibility models > Available in United States, Canada, Australia and United Kingdom. > On average, 236 members in US marry every day. > More than 20 million registered users. 4

About eharmony > Models are based on decades of research and clinical experience in psychology > Variety of user attributes Demographic Psychographic Behavioral 5

Model Creation Research Models Matching Process User Attributes Matches Scorer Scores 6

Matching Research Models Matching Process User Attributes Matches Scorer Scores 7

Predicative Model Scores Research Models Matching Process User Attributes Matches Scorer Scores 8

Match Outcomes Feedback Research Models Matching Process User Attributes Matches Scorer Scores 9

Model Validation and Creation Research Models Matching Process User Attributes Matches Scorer Scores 10

Scorer Requirements > Tens of GB of matches, scores and constantly changing user features are archived daily > TB of data currently archived and growing > 10x our current user base > All possible matches = O(n 2 ) problem > Support a growing set of models that may be arbitrarily complex computationally and I/O expensive. 11

Scaling Challenges > Current architecture is multi-tiered with a relational back-end > Scoring is DB join intensive > Data need constant archiving Matches, match scores, user attributes at time of match creation Model validation is done at a later time across many days > Need a non-db solution 12

> Open Source Java implementation of Google s MapReduce paper Created by Doug Cutting > Top-level Apache project Yahoo is a major sponsor 13

> Distributes work across vast amounts of data 14

> Distributes work across vast amounts of data > Hadoop Distributed File System (HDFS) provides reliability through replication 15

> Distributes work across vast amounts of data > Hadoop Distributed File System (HDFS) provides reliability through replication > Automatic re-execution on failure/distribution 16

> Distributes work across vast amounts of data > Hadoop Distributed File System (HDFS) provides reliability through replication > Automatic re-execution on failure/distribution > Scale horizontally on commodity hardware 17

> Simple Storage Service (S3) provides cheap unlimited storage Storage $0.15/GB First 50 TB/Month Transfer $0.17/GB First 10 TB/Month 18

> Simple Storage Service (S3) provides cheap unlimited storage. Storage $0.15/GB First 50 TB/Month Transfer $0.17/GB First 10 TB/Month > Elastic Cloud Computing (EC2) enables horizontal scaling by adding servers on demand. C1-medium $0.2/hour 2 cores 1.7 GB Memory C1-xlarge $0.8/hour 8 cores 7 GB Memory 32-bit OS 64-bit OS 19

> Elastic MapReduce is a hosted Hadoop framework on top EC2 and S3. > It s in beta and US only. > Pricing is in addition to EC2 and S3. Instance Type EC2 Elastic MapReduce C1-medium $0.2/hour $0.03/hour C1-xlarge $0.8/hour $0.12/hour 20

Agenda > Background and Motivation > Hadoop Overview > Hadoop at eharmony > Architecture > Tools and Performance > Roadblocks > Future Directions > Summary > Questions 21

How MapReduce Works > Applications are modeled as a series of maps and reductions > Example - Word Count Counts the frequency of words Modeled as one Map and one Reduce Data as key -> values 22

Word Count - Input I will not eat them here nor there I will not eat them anywhere

Word Count Pass to Mappers I will not eat them here nor there I will not eat them anywhere Map1 Map2

Word Count Perform Map I will not eat them here nor there I will not eat them anywhere Map1 I -> 1 will -> 1 not -> 1 eat -> 1 them -> 1 Map2 I -> 1 will -> 1 not -> 1 eat -> 1 green -> 1

Word Count Shuffle and Sort I -> 1 will -> 1 not -> 1 eat -> 1 them -> 1 am -> 1 and -> 1 anywhere -> 1 eat -> 1 eat -> 1 I -> 1 will -> 1 not -> 1 eat -> 1 green -> 1 nor -> 1 not -> 1 not -> 1 not -> 1 not -> 1 Sam -> 1

Word Count Pass to Reducers am -> 1 and -> 1 anywhere -> 1 eat -> 1 eat -> 1 Reduce1 nor -> 1 not -> 1 not -> 1 not -> 1 not -> 1 Sam -> 1 Reduce2

Word Count Perform Reduce am -> 1 and -> 1 anywhere -> 1 eat -> 1 eat -> 1 Reduce1 am,1 and,1 anywhere,1 eat,4 eggs,1 nor -> 1 not -> 1 not -> 1 not -> 1 not -> 1 Sam -> 1 Reduce2 nor,1 not,4 Sam,1 them,3 there,1

Hadoop Benefits > Mapper and Reducer are written by you > Hadoop provides Parallelization Shuffle and sort 29

Agenda > Background and Motivation > Hadoop Overview > Hadoop at eharmony > Architecture > Tools and Performance > Roadblocks > Future Directions > Summary > Questions 30

Real World MapReduce > Complex applications use a series of MapReduces > Values from one step can become keys for another > Match Scoring Example Join match data and user A attributes into one line Join above with user B attributes and calculate the match score Group by match scores by user 31

Join: Map Match Adam Betty Carlos Debbie Adam June Map1 Adam -> Adam Betty Carlos -> Carlos Debbie Adam -> Adam June User Attribute Map2 Adam -> 1 Betty -> 2 Carlos -> 3 Debbie -> 4 Fred -> 5 June -> 6

Join: Shuffle and Sort Adam -> Adam Betty Carlos -> Carlos Debbie Adam -> Adam June Adam -> Adam Betty Adam -> 1 Adam -> Adam June.. Adam -> 1 Betty -> 2 Carlos -> 3 Debbie -> 4 Fred -> 5 June -> 6 Carlos -> Carlos Debbie Carlos -> 3

Join: Reduce Adam -> Adam Betty Adam -> 1 Adam -> Adam June Reduce1 Adam -> Adam Betty 1 Adam -> Adam June 1 Carlos -> Carlos Debbie Carlos -> 3 Reduce2 Carlos -> Carlos Debbie 3

Join and Score: Map Adam Betty 1 Carlos Debbie 3 Adam June 1 Map1 Betty -> Adam Betty 1 Debbie -> Carlos Debbie 3 June -> Adam June 1 Map2

Join and Score: Shuffle and Sort Betty -> Adam Betty 1 Debbie -> Carlos Debbie 3 June -> Adam June 1 Betty -> Adam Betty 1 Betty -> 2 Debbie -> Carlos Debbie 3 Debbie -> 4 June -> Adam June 1 June -> 6

Join and Score: Reduce Betty -> Adam Betty 1 Betty -> 2 June -> Adam June 1 June -> 6 Reduce1 Adam Betty -> 1 2 Adam June -> 1 6 Debbie -> Carlos Debbie 3 Debbie -> 4 Reduce2 Carlos Debbie -> 3 4

Join and Score: Reduce (Optimized) Betty -> Adam Betty 1 Betty -> 2 June -> Adam June 1 June -> 6 Reduce1 Adam Betty -> 3 Adam June -> 7 Debbie -> Carlos Debbie 3 Debbie -> 4 Reduce2 Carlos Debbie -> 7

Combine: Map Adam Betty 3 Adam June 7 Map1 Adam -> Adam Betty 3 Adam -> Adam June 7 Carlos Debbie 7 Map2 Carlos -> Carlos Debbie 7

Combine: Shuffle and Sort Adam -> Adam Betty 3 Adam -> Adam June 7 Adam -> Adam Betty 3 Adam -> Adam June 7 Carlos -> Carlos Debbie 7 Carlos -> Carlos Debbie 7

Combine: Reduce Adam -> Adam Betty 3 Adam -> Adam June 7 Reduce1 Adam:Betty=3 June=7 Carlos -> Carlos Debbie 7 Reduce2 Carlos:Debbie=7

Agenda > Background and Motivation > Hadoop Overview > Hadoop at eharmony > Architecture > Tools and Performance > Roadblocks > Future Directions > Summary > Questions 42

Overview 43

Hadoop Run Data Warehouse Unload 44

Hadoop Run Upload to S3 45

Hadoop Run Start EC2 Cluster 46

Hadoop Run MapReduce Jobs 47

Hadoop Run Stop EC2 Cluster 48

Hadoop Run Download from S3 49

Hadoop Run Store Result 50

AWS Elastic MapReduce > Only need to think in terms of Elastic MapReduce job flow > EC2 cluster is managed for you behind the scenes > Each job flow has one or more steps > Each step is a Hadoop MapReduce process > Each step can read and write data directly from and to S3 > Based on Hadoop 0.18.3 51

Elastic MapReduce for eharmony > Vastly simplify our scripts No need to explicitly allocate, start and shutdown EC2 instances Individual jobs were managed by a remote script running on master node (no longer required) Jobs are arranged into a job flow, created with a single command Status of a job flow and all its steps are accessible by a REST service 52

Simplified Job Control > Before Elastic Map Reduce, we had to explicitly: Allocate cluster Verify cluster Push application to cluster Run a control script on the master Kick off each job step on the master Create and detect a job completion token Shut the cluster down > Over 150 lines of script just for this 53

Simplified Job Control > With Elastic MapReduce we can do all this with a single local command > Uses jar and conf files stored on S3 #{ELASTIC_MR_UTIL} --create --name #{JOB_NAME} --num-instances #{NUM_INSTANCES} --instance-type #{INSTANCE_TYPE} --key_pair #{KEY_PAIR_NAME} --log-uri #{SCORER_LOG_BUCKET_URL} --jar #{SCORER_JAR_S3_PATH} --main-class #{MR_JAVA_PACKAGE}.join.JoinJob --arg - xconf --arg #{MASTER_CONF_DIR}/join-config.xml --jar #{SCORER_JAR_S3_PATH} --main-class #{MR_JAVA_PACKAGE}.scorer.ScorerJob -- arg -xconf --arg #{MASTER_CONF_DIR}/scorer-config.xml --jar #{SCORER_JAR_S3_PATH} --main-class #{MR_JAVA_PACKAGE}.combiner.CombinerJob --arg -xconf --arg #{MASTER_CONF_DIR}/ combiner-config-#{target_env}.xml 54

Agenda > Background and Motivation > Hadoop Overview > Hadoop at eharmony > Architecture > Tools and Performance > Roadblocks > Future Directions > Summary > Questions 55

Development Environment > Cheap to set up on Amazon > Quick setup Number of servers is controlled by a config variable > Identical to production > Separate development account recommended 56

Testing Demo > Test Hadoop jobs in IDE > A Hadoop cluster is not required 57

Unit Test Demo 58

Unit Test Demo 59

EC2, S3 and Hadoop Monitoring Tools > ElasticFox for EC2 > Hadoop provides web pages for job and disk monitoring. > Tim Kay s AWS command line tool for S3 > Plus many more 60

ElasticFox Manage EC2 Cluster 61

Hadoop JobTracker Monitor Jobs 62

Hadoop DFS Monitor Disk Usage 63

AWS Management Console > Useful for Elastic MapReduce Terminate job flow Track execution > Also monitor EC2 64

AWS Management Console Elastic MapReduce BETA 65

AWS Management Console EC2 66

Hadoop on EC2 Performance Minutes 350 331 300 250 212 200 150 144 100 100 50 0 c1-medium, 24 nodes c1-xlarge, 24 nodes c1-medium, 49 nodes c1-xlarge, 49 nodes Hadoop 0.18.0 67

Total Execution Time 68

Our Cost > Average EC2 and S3 Cost Each run is 2 to 3 hours $1200/month for EC2 $100/month for S3 69

Our Cost > Average EC2 and S3 Cost Each run is 2 to 3 hours $1200/month for EC2 $100/month for S3 > Projected in-house cost $5000/month for a local cluster of 50 nodes running 24/7 A new company needs to add data center and operation personnel expense 70

Agenda > Background and Motivation > Hadoop Overview > Hadoop at eharmony > Architecture > Tools and Performance > Roadblocks > Future Directions > Summary > Questions 71

Process Control is Non-Trivial > Scripts galore > One bash script for each stage > Moving to Ruby Good exception handling and control structures More productive 72

Design for Failure > The overall process depends on the success of each stage > Every stage is unreliable > Hadoop master node is a single point of failure; slave nodes are fault tolerant > Need to build retry logic to handle failures 73

Design for Failure S3 Web Service > S3 web service can time out > Add logic to validate file is correctly uploaded/ downloaded from S3 > We retry once on failure 74

Design for Failure EC2 Web Service > On very rare occasions, EC2 would allocate less slave nodes than requested usually one less. > On even rarer occasions, no node is allocated. > Right now script fails and sends an alert email. > Consider changing the script to have it continue if the number of slave nodes missed is small > Or reallocate the missing nodes 75

Design for Failure Elastic MapReduce BETA > Provisioning of the servers is not yet stable Frequently failed in the first two weeks of the beta program It blocks until provisioning is complete Only billed for job execution time > It s handy to terminate a hanged job with Amazon Web Services Management Console > Amazon recognizes this shortcoming and is working on a solution 76

Data Shuffle > Spend same amount of time getting data out of DW, upload to and download from S3, creating local store as running Hadoop > Hadoop and EC2 scale nicely > New scaling challenge is to reduce the data shuffle time and error recovery. 77

Agenda > Background and Motivation > Hadoop Overview > Hadoop at eharmony > Architecture > Tools and Performance > Roadblocks > Future Directions > Summary > Questions 78

Handle More Modeling Data > Data warehouse unloads additional dimensions of user data into separate files for performance reason. > Currently user and match data is joined into one row which is not scalable. > Each model does not use all data. Partial joins Partial scores Combine partial scores to final score Load the delta of user data between runs 79

Data Analysis on the Cloud > Daily reporting: use Hadoop instead of depending on data warehouse. Median/Mean score per user Median/Mean number of matches per user, country, zipcode 80

Data Analysis on the Cloud with Pig > Our success opens Hadoop to other teams > Provides expertise on Hadoop > Data analysis with Pig High-level language on top of Hadoop: think SQL for RDBMS Hadoop subproject Still not non-engineer friendly for troubleshooting Slower than hand-rolled MapReduce > Yet to evaluate Hive, Cascade, others 81

Data Analysis with R > Data analysis with R Excellent environment for statistical computing and graphic presentation Open source implementation of S Supports a plugin framework New packages are continued being developed by an active community Limitation is that all data must be in memory Start to investigate RHIPE which aims to integrate R and Hadoop 82

Agenda > Background and Motivation > Hadoop Overview > Hadoop at eharmony > Architecture > Tools and Performance > Roadblocks > Future Directions > Summary > Questions 83

Lesson Learned > Reliability is the biggest challenge > Getting data out of DW is difficult and time consuming > Controlling process with shell scripts is a hassle > Dev tools really easy to work with and just work right out of the box > Standard Hadoop AMI worked great > Easy to write unit tests for MapReduce > Hadoop community support is great. > EC2/S3/EMR are cost effective 84

Acknowledgement > 236 members marry a day is based on survey conducted by Harris Interactive Research. > Hadoop is an Apache project > S3 and EC2 are part of Amazon Web Service. > R s URL is http://www.r-project.org/ > RHIPE s URL is http://ml.stat.purdue.edu/rhipe 85

Agenda > Background and Motivation > Hadoop Overview > Hadoop at eharmony > Architecture > Tools and Performance > Roadblocks > Future Directions > Summary > Questions 86

Steve Kuo, Software Architect Joshua Tuberville, Software Architect eharmony 87