Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce



Similar documents
Analyzing Big Data with AWS

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Data on Microsoft Platform

Tap into Hadoop and Other No SQL Sources

How To Handle Big Data With A Data Scientist

Amazon Elastic MapReduce. Jinesh Varia Peter Sirota Richard Cole

Large scale processing using Hadoop. Ján Vaňo

Hadoop & Spark Using Amazon EMR

INTRODUCTION TO CASSANDRA

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

How To Scale Out Of A Nosql Database

BIG DATA TRENDS AND TECHNOLOGIES

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Hadoop implementation of MapReduce computational model. Ján Vaňo

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

So What s the Big Deal?

Next-Generation Cloud Analytics with Amazon Redshift

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Open source Google-style large scale data analysis with Hadoop

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Big Data: Beyond the Hype

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop. Sunday, November 25, 12

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Chapter 7. Using Hadoop Cluster and MapReduce

Testing 3Vs (Volume, Variety and Velocity) of Big Data

NextGen Infrastructure for Big DATA Analytics.

Using Tableau Software with Hortonworks Data Platform

NoSQL for SQL Professionals William McKnight

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

The Future of Data Management

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Cloud-based Analytics and Map Reduce

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Microsoft Big Data. Solution Brief

Big Data and Market Surveillance. April 28, 2014

Real Time Big Data Processing

Implement Hadoop jobs to extract business value from large and varied data sets

Investor Presentation. Second Quarter 2015

White Paper: Datameer s User-Focused Big Data Solutions

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Hadoop IST 734 SS CHUNG

How To Create A Large Data Storage System

Virtualizing Apache Hadoop. June, 2012

CitusDB Architecture for Real-Time Big Data

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop & its Usage at Facebook

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

BIG DATA What it is and how to use?

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

The Inside Scoop on Hadoop

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Are You Ready for Big Data?

Native Connectivity to Big Data Sources in MSTR 10

Why Big Data Analytics?

Navigating the Big Data infrastructure layer Helena Schwenk

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Big data blue print for cloud architecture

W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract

How to Leverage Cloud to Quickly Build Scalable Applications

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Application Development. A Paradigm Shift

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

The Enterprise Data Hub and The Modern Information Architecture

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Big Data and Hadoop for the Executive A Reference Guide

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

CS 378 Big Data Programming

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Big Data Technologies Compared June 2014

Are You Ready for Big Data?

Big Data & Cloud Computing. Faysal Shaarani

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

HDP Hadoop From concept to deployment.

Agile Business Intelligence Data Lake Architecture

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

P4.1 Reference Architectures for Enterprise Big Data Use Cases Romeo Kienzler, Data Scientist, Advisory Architect, IBM Germany, Austria, Switzerland

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Transcription:

Analytics in the Cloud Peter Sirota, GM Elastic MapReduce

Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor.

What is Big Data? Terabytes of semi-structured log data in which businesses want to: find correlations/perform pattern matching generate recommendations calculate advanced statistics (i.e., TP99) Twitter Firehose 50 million tweets per day 1,400% growth per year How can advertisers drink from it? Social graphs Value increases with exponential growth in data connections Big Data is full of valuable, unanswered questions!

Why is Big Data Hard (and Getting Harder)? Today s Data Warehouses Need to consolidate from multiple data sources in multiple formats across multiple businesses Unconstrained growth of this business-critical information Today s Users Expect faster response time of fresher data Sampling is not good enough and history is important Demand inexpensive experimentation with new data Become increasingly sophisticated Data Scientists Current systems don t scale (and weren t meant to) Long time to provision more infrastructure Specialized DB expertise required Expensive and inelastic solutions We need tools built specifically for Big Data!

What is this thing called Hadoop? Dealing with Big Data requires two things: Distributed, scalable storage Inexpensive, flexible analytics Apache Hadoop is an open source software platform that addresses both of these needs Includes a fault tolerant, distributed storage system (HDFS) developed for commodity servers Uses a technique called MapReduce to carry out exhaustive analysis over huge distributed data sets Key benefits Affordable Cost / TB is a fraction of traditional options Proven at scale Numerous petabyte implementations in production; linear scalability Flexible Data can be stored with or without schema

RDBMS vs. MapReduce/Hadoop RDBMS Predefined schema Strategic data placement for query tuning Exploit indexes for fast retrieving SQL only Doesn t scale linearly MapReduce/Hadoop No schema is required Random data placement Fast scan of the entire dataset Uniform query performance Linearly scales for reads and writes Support many languages including SQL Complementary technologies

Why Amazon Elastic MapReduce? Managed Apache Hadoop Web Service Monitor thousands of clusters per day Use cases span from University students to Fortune 50 Reduces complexity of Hadoop management Handles node provisioning, customization, and shutdown Tunes Hadoop to your hardware and network Provides tools to debug and monitor your Hadoop clusters Provides tight integration with AWS services Improved performance working with S3 Automatic re-provisioning on node failure Dynamic expanding/shrinking of cluster size Spot integration

Elastic MapReduce Key Features Simplified Cluster Configuration/Management Resize running job flows Support for EIP/IAM/Tagging Workload-specific configurations Bootstrap Actions Enhanced Monitoring/Debugging Free CloudWatch Metrics / Alarms Hadoop Metrics in Console Ganglia Support Improved Performance S3 Multipart Upload Cluster Compute Instances

Analytics Use Cases Targeted advertising / Clickstream analysis Data warehousing applications Bio-informatics (Genome analysis) Financial simulation (Monte Carlo simulation) File processing (resize jpegs) Web indexing Data mining and BI

APACHE HIVE DATA WAREHOUSE FOR HADOOP Open source project started at Facebook Turns data on Hadoop into a virtually limitless data warehouse Provides data summarization, ad hoc querying and analysis Enables SQL-like queries on structured and unstructured data E.g. arbitrary field separators possible such as, in CSV file formats Inherits linear scalability of Hadoop

AWS Data Warehousing Architecture

Elastic Data Warehouse Customize cluster size to support varying resource needs (e.g. query support during the day versus batch processing overnight) Reduce costs by increasing server utilization Improve performance during high usage periods Data Warehouse (Steady State) Data Warehouse (Batch Processing) Data Warehouse (Steady State) Expand to 25 instances Shrink to 9 instances

Reducing Costs with Spot Instances Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption Scenario #1 Job Flow Duration: 14 Hours Scenario #2 Job Flow Duration: 7 Hours #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 #2: Cost with Spot 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75 Time Savings: 50% Cost Savings: ~22% Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing

Monitoring Clusters with CloudWatch Free CloudWatch Metrics and Alarms Track Hadoop job progress Alarm on degradations in cluster health Monitor aggregate Elastic MapReduce usage

Big Data Ecosystem And Tools We have a rapidly growing ecosystem and will continue to integrate with a wide range of partners. Some examples: Business Intelligence MicroStrategy, Pentaho Analytics Datameer, Karmasphere, Quest Open source Ganglia, SQuirrel SQL

Resources Amazon Elastic MapReduce aws.amazon.com/elasticmapreduce aws.amazon.com/articles/elastic-mapreduce forums.aws.amazon.com/forum.jspa?forumid=52