Analyzing Big Data with AWS

Similar documents
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Real Time Big Data Processing

Tap into Hadoop and Other No SQL Sources

The Future of Data Management

The Enterprise Data Hub and The Modern Information Architecture

So What s the Big Deal?

Amazon Elastic MapReduce. Jinesh Varia Peter Sirota Richard Cole

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

BIG DATA TRENDS AND TECHNOLOGIES

How to Leverage Cloud to Quickly Build Scalable Applications

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

How To Handle Big Data With A Data Scientist

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big data blue print for cloud architecture

Building your Big Data Architecture on Amazon Web Services

BIG DATA What it is and how to use?

Next-Generation Cloud Analytics with Amazon Redshift

Hadoop & Spark Using Amazon EMR

Are You Ready for Big Data?

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Are You Ready for Big Data?

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Native Connectivity to Big Data Sources in MSTR 10

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

DATAMEER WHITE PAPER. Beyond BI. Big Data Analytic Use Cases

How To Scale Out Of A Nosql Database

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

The Future of Data Management with Hadoop and the Enterprise Data Hub

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Microsoft Big Data. Solution Brief

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data on Microsoft Platform

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Big Data Analytics. Lucas Rego Drumond

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

BEYOND BI: Big Data Analytic Use Cases

INTELLIGENT BUSINESS STRATEGIES WHITE PAPER

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

HDP Hadoop From concept to deployment.

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

Big Data: Tools and Technologies in Big Data

Virtualizing Apache Hadoop. June, 2012

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

NoSQL for SQL Professionals William McKnight

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

The 4 Pillars of Technosoft s Big Data Practice

Using Tableau Software with Hortonworks Data Platform

From Spark to Ignition:

Why Big Data Analytics?

Transforming the Telecoms Business using Big Data and Analytics

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Industry Impact of Big Data in the Cloud: An IBM Perspective

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Industry 4.0 and Big Data

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Big Data & Cloud Computing. Faysal Shaarani

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Big Data Integration: A Buyer's Guide

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Big Systems, Big Data

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Big Data and Industrial Internet

An Oracle White Paper October Oracle: Big Data for the Enterprise

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

How to make BIG DATA work for you. Faster results with Microsoft SQL Server PDW

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Deploying Big Data to the Cloud: Roadmap for Success

SAP and Hortonworks Reference Architecture

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

CitusDB Architecture for Real-Time Big Data

COMP9321 Web Application Engineering

Exploiting Data at Rest and Data in Motion with a Big Data Platform

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Transcription:

Analyzing Big Data with AWS Peter Sirota, General Manager, Amazon Elastic MapReduce @petersirota

What is Big Data?

Computer generated data Application server logs (web sites, games) Sensor data (weather, water, smart grids) Images/videos (traffic, security cameras)

Human generated data Twitter Firehose (50 mil tweets/day 1,400% growth per year) Blogs/Reviews/Emails/Pictures Social graphs Facebook, linked-in, contacts

Big Data is full of valuable, unanswered questions!

Why is Big Data Hard (and Getting Harder)?

Why is Big Data Hard (and Getting Harder)? Data Volume Unconstrained growth Current systems don t scale

Why is Big Data Hard (and Getting Harder)? Data Structure Need to consolidate data from multiple data sources in multiple formats across multiple businesses

Why is Big Data Hard (and Getting Harder)? Changing Data Requirements Faster response time of fresher data Sampling is not good enough Increasing complexity of analytics Users demand inexpensive experimentation

We need tools built specifically for Big Data!

Innovation #1: Apache Hadoop The MapReduce computational paradigm Open source, scalable, fault tolerant, distributed system Hadoop lowers the cost of developing a distributed system for data processing

Innovation #2: Amazon Elastic Compute Cloud (EC2) provides resizable compute capacity in the cloud. Amazon EC2 lowers the cost of operating a distributed system for data processing

Amazon Elastic MapReduce = Amazon EC2 + Hadoop

Elastic MapReduce applications Targeted advertising / Clickstream analysis Security: anti-virus, fraud detection, image recognition Pattern matching / Recommendations Data warehousing / BI Bio-informatics (Genome analysis) Financial simulation (Monte Carlo simulation) File processing (resize jpegs, video encoding) Web indexing

Clickstream Analysis Big Box Retailer came to Razorfish 3.5 billion records 71 million unique cookies 1.7 million targeted ads required per day Problem: Improve Return on Ad Spend (ROAS)

Clickstream Analysis User recently purchased a sports movie and is searching for video games Targeted Ad (1.7 Million per day)

Clickstream Analysis Lots of experimentation but final design: 100 node on-demand Elastic MapReduce cluster running Hadoop

Clickstream Analysis Processing time dropped from 2+ days to 8 hours (with lots more data)

Clickstream Analysis Increased Return On Ad Spend by 500%

World s largest handmade marketplace 8.9 million items 1 billion page view per month $320MM 2010 GMS

Job Job Web event logs ETL Step 1 ETL Step 2 Production DB snapshots Job Easy to backfill and run experiments just boot up a cluster with 100, 500, or 1000 nodes

Recommendations The Taste Test http://www.etsy.com/tastetest

Recommendations Gift Ideas for Facebook Friends etsy.com/gifts

Yelp s Business Generates a Lot of Data 400 GB of logs per day ~12 Terabytes per month

They Frequently Analyze this Data to Power Key Features of their Site

Autocomplete Search

Recommendations

Automatic spelling corrections

Automatic spelling corrections Let s take a Look at how this works

1) Load log file data for six months of user search history into Amazon S3 Amazon S3 Search ID Search Text Final Selection 12423451 westen Westin 14235235 wisten Westin 54332232 westenn Westin 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451 14235235 54332232 12423451

Hadoop Cluster 2) Spin up a 200 node cluster of virtual servers in the cloud Log Files Amazon S3 Amazon EMR

Hadoop Cluster 3) 200 nodes simultaneously analyze this data looking for common misspellings this takes a few hours Amazon S3 Amazon EMR

Hadoop Cluster 4) New common misspellings and suggestions loaded back into S3 Log Files Amazon S3 Amazon EMR

5) When the job is done, the cluster is shut down. Yelp only pays for the time they used. Log Files Amazon S3 Amazon EMR

Each of their 80 developers can do this whenever they have a big data problem to analyze Log file data 250 clusters spun up and down every week

Data size Global reach Native app for almost every smartphone, SMS, web, mobile-web 10M+ users, 15M+ venues, ~1B check-ins Terabytes of log data

Data Stack Application Stack Stack Scala/Liftweb API Machines WWW Machines Batch Jobs Scala Application code Mongo/Postgres/Flat Files Databases Logs mongoexport postgres dump Flume Amazon S3 Database Dumps Log Files Hadoop Elastic Map Reduce Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

Computing venue-to-venue similarity Spin up 40 node cluster Submit Ruby streaming job Invert User x Venue matrix Grab Co-occurrences Compute similarity Spin down cluster Load data to app server

Who is checking in? 0.6 Gender Age 0.5 0.4 0.3 0.2 0.1 0 Female Male 0 20 40 60 80

What are people doing?

Where are our users?

When do people go to a place? Gorilla Coffee Gray's Papaya Amorino Thursday Friday Saturday Sunday

Why are people checking in? Explore their city, discover new places Find friends, meet up Save with local deals Get insider tips on venues Personal analytics, diary Follow brands and celebrities Earn points, badges, gamification of life The list grows

Over 1000 s customers using EMR

RDBMS vs. MapReduce/Hadoop RDBMS Predefined schema Strategic data placement for query tuning Exploit indexes for fast retrieving SQL only Doesn t scale linearly MapReduce/Hadoop No schema is required Random data placement Fast scan of the entire dataset Uniform query performance Linearly scales for reads and writes Support many languages including SQL Complementary technologies

AWS Data Warehousing Architecture

Elastic Data Warehouse Customize cluster size to support varying resource needs (e.g. query support during the day versus batch processing overnight) Reduce costs by increasing server utilization Improve performance during high usage periods Data Warehouse (Steady State) Data Warehouse (Batch Processing) Data Warehouse (Steady State) Expand to 25 instances Shrink to 9 instances

Reducing Costs with Spot Instances Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption Scenario #1 Job Flow Duration: 14 Hours Scenario #2 Job Flow Duration: 7 Hours #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 #2: Cost with Spot 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75 Time Savings: 50% Cost Savings: ~22% Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing

Big Data Ecosystem And Tools We have a rapidly growing ecosystem Business Intelligence MicroStrategy, Pentaho Analytics Datameer, Karmasphere, Quest Open source Ganglia, Squirrel SQL

Thank You!! http://aws.amazon.com/elasticmapreduce/