Analyzing Big Data with AWS Peter Sirota, General Manager, Amazon Elastic MapReduce @petersirota
What is Big Data?
Computer generated data Application server logs (web sites, games) Sensor data (weather, water, smart grids) Images/videos (traffic, security cameras)
Human generated data Twitter Firehose (50 mil tweets/day 1,400% growth per year) Blogs/Reviews/Emails/Pictures Social graphs Facebook, linked-in, contacts
Big Data is full of valuable, unanswered questions!
Why is Big Data Hard (and Getting Harder)?
Why is Big Data Hard (and Getting Harder)? Data Volume Unconstrained growth Current systems don t scale
Why is Big Data Hard (and Getting Harder)? Data Structure Need to consolidate data from multiple data sources in multiple formats across multiple businesses
Why is Big Data Hard (and Getting Harder)? Changing Data Requirements Faster response time of fresher data Sampling is not good enough Increasing complexity of analytics Users demand inexpensive experimentation
We need tools built specifically for Big Data!
Innovation #1: Apache Hadoop The MapReduce computational paradigm Open source, scalable, fault tolerant, distributed system Hadoop lowers the cost of developing a distributed system for data processing
Innovation #2: Amazon Elastic Compute Cloud (EC2) provides resizable compute capacity in the cloud. Amazon EC2 lowers the cost of operating a distributed system for data processing
Amazon Elastic MapReduce = Amazon EC2 + Hadoop
Elastic MapReduce applications Targeted advertising / Clickstream analysis Security: anti-virus, fraud detection, image recognition Pattern matching / Recommendations Data warehousing / BI Bio-informatics (Genome analysis) Financial simulation (Monte Carlo simulation) File processing (resize jpegs, video encoding) Web indexing
Clickstream Analysis Big Box Retailer came to Razorfish 3.5 billion records 71 million unique cookies 1.7 million targeted ads required per day Problem: Improve Return on Ad Spend (ROAS)
Clickstream Analysis User recently purchased a sports movie and is searching for video games Targeted Ad (1.7 Million per day)
Clickstream Analysis Lots of experimentation but final design: 100 node on-demand Elastic MapReduce cluster running Hadoop
Clickstream Analysis Processing time dropped from 2+ days to 8 hours (with lots more data)
Clickstream Analysis Increased Return On Ad Spend by 500%
World s largest handmade marketplace 8.9 million items 1 billion page view per month $320MM 2010 GMS
Job Job Web event logs ETL Step 1 ETL Step 2 Production DB snapshots Job Easy to backfill and run experiments just boot up a cluster with 100, 500, or 1000 nodes
Recommendations The Taste Test http://www.etsy.com/tastetest
Recommendations Gift Ideas for Facebook Friends etsy.com/gifts
Yelp s Business Generates a Lot of Data 400 GB of logs per day ~12 Terabytes per month
They Frequently Analyze this Data to Power Key Features of their Site
Autocomplete Search
Recommendations
Automatic spelling corrections
Automatic spelling corrections Let s take a Look at how this works
1) Load log file data for six months of user search history into Amazon S3 Amazon S3 Search ID Search Text Final Selection 12423451 westen Westin 14235235 wisten Westin 54332232 westenn Westin 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451
Hadoop Cluster 2) Spin up a 200 node cluster of virtual servers in the cloud Log Files Amazon S3 Amazon EMR
Hadoop Cluster 3) 200 nodes simultaneously analyze this data looking for common misspellings this takes a few hours Amazon S3 Amazon EMR
Hadoop Cluster 4) New common misspellings and suggestions loaded back into S3 Log Files Amazon S3 Amazon EMR
5) When the job is done, the cluster is shut down. Yelp only pays for the time they used. Log Files Amazon S3 Amazon EMR
Each of their 80 developers can do this whenever they have a big data problem to analyze Log file data 250 clusters spun up and down every week
Data size Global reach Native app for almost every smartphone, SMS, web, mobile-web 10M+ users, 15M+ venues, ~1B check-ins Terabytes of log data
Data Stack Application Stack Stack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases Logs mongoexport postgres dump Flume Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs
Computing venue-to-venue similarity Spin up 40 node cluster Submit Ruby streaming job Invert User x Venue matrix Grab Co-occurrences Compute similarity Spin down cluster Load data to app server
Who is checking in? 0.6 Gender Age 0.5 0.4 0.3 0.2 0.1 0 Female Male 0 20 40 60 80
What are people doing?
Where are our users?
When do people go to a place? Gorilla Coffee Gray's Papaya Amorino Thursday Friday Saturday Sunday
Why are people checking in? Explore their city, discover new places Find friends, meet up Save with local deals Get insider tips on venues Personal analytics, diary Follow brands and celebrities Earn points, badges, gamification of life The list grows
Over 1000 s customers using EMR
RDBMS vs. MapReduce/Hadoop RDBMS Predefined schema Strategic data placement for query tuning Exploit indexes for fast retrieving SQL only Doesn t scale linearly MapReduce/Hadoop No schema is required Random data placement Fast scan of the entire dataset Uniform query performance Linearly scales for reads and writes Support many languages including SQL Complementary technologies
AWS Data Warehousing Architecture
Elastic Data Warehouse Customize cluster size to support varying resource needs (e.g. query support during the day versus batch processing overnight) Reduce costs by increasing server utilization Improve performance during high usage periods Data Warehouse (Steady State) Data Warehouse (Batch Processing) Data Warehouse (Steady State) Expand to 25 instances Shrink to 9 instances
Reducing Costs with Spot Instances Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption Scenario #1 Job Flow Duration: 14 Hours Scenario #2 Job Flow Duration: 7 Hours #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 #2: Cost with Spot 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75 Time Savings: 50% Cost Savings: ~22% Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing
Big Data Ecosystem And Tools We have a rapidly growing ecosystem Business Intelligence MicroStrategy, Pentaho Analytics Datameer, Karmasphere, Quest Open source Ganglia, Squirrel SQL
Thank You!! http://aws.amazon.com/elasticmapreduce/