Hadoop & Spark Using Amazon EMR

Similar documents

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Real Time Big Data Processing

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Hadoop Ecosystem B Y R A H I M A.

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Unified Big Data Processing with Apache Spark. Matei

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

Professional Hadoop Solutions

How to Leverage Cloud to Quickly Build Scalable Applications

Native Connectivity to Big Data Sources in MSTR 10

Big Data Management and Security

Building your Big Data Architecture on Amazon Web Services

Big Data With Hadoop

Upcoming Announcements

Ali Ghodsi Head of PM and Engineering Databricks

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Implement Hadoop jobs to extract business value from large and varied data sets

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

AWS Key Management Service. Developer Guide

Big Data and Market Surveillance. April 28, 2014

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Moving From Hadoop to Spark

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

COURSE CONTENT Big Data and Hadoop Training

Encrypting Data at Rest

Financial Services Grid Computing on Amazon Web Services January 2013 Ian Meyers

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Scalability in the Cloud HPC Convergence with Big Data in Design, Engineering, Manufacturing

Amazon EC2 Product Details Page 1 of 5

Oracle Big Data SQL Technical Update

How To Use Aws.Com

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Chase Wu New Jersey Ins0tute of Technology

Apache Hadoop: Past, Present, and Future

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Hadoop IST 734 SS CHUNG

Introduction to AWS in Higher Ed

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Luncheon Webinar Series May 13, 2013

Big Data and Industrial Internet

Dell In-Memory Appliance for Cloudera Enterprise

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Deploying Hadoop with Manager

Unified Big Data Analytics Pipeline. 连城

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Data processing goes big

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Apache Hadoop. Alexandru Costan

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Alfresco Enterprise on AWS: Reference Architecture

Amazon Web Services Annual ALGIM Conference. Tim Dacombe-Bird Regional Sales Manager Amazon Web Services New Zealand

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Virtualizing Apache Hadoop. June, 2012

Cloud Models and Platforms

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Extending Hadoop beyond MapReduce

Oracle Big Data Fundamentals Ed 1 NEW

The Inside Scoop on Hadoop

Building 1000 node cluster on EMR Manjeet Chayel

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

The Future of Data Management

CitusDB Architecture for Real-Time Big Data

Map Reduce & Hadoop Recommended Text:

How Companies are! Using Spark

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

How To Create A Data Visualization With Apache Spark And Zeppelin

Scaling Out With Apache Spark. DTL Meeting Slides based on

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop: Embracing future hardware

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Amazon Cloud Storage Options

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Why Big Data in the Cloud?

The Internet of Things and Big Data: Intro

Amazon Elastic Beanstalk

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Transcription:

Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Agenda Why did we build Amazon EMR? What is Amazon EMR? How do I get started using Amazon EMR? Q&A

Why did we build Amazon EMR?

Ingest Store Process Visualize Amazon RDS Amazon Kinesis Amazon EMR Amazon Machine Learning Amazon Mobile Analytics Amazon DynamoDB Amazon CloudSearch Amazon Redshift AWS Data Pipeline AWS Import/ Export Amazon S3 Amazon Glacier Amazon Lambda Amazon EC2

Hadoop: The Framework for Big Data Processing Can process huge amounts of data Scalable and fault tolerant Respects multiple data formats Supports an ecosystem of tools with a robust community Executes batch and real-time analytics

Common Customer Challenges Must purchase, provision, deploy and manage the infrastructure for Hadoop Deploying, configuring and administering the Hadoop clusters can be difficult, expensive and time-consuming

What is Amazon Elastic MapReduce?

Amazon EMR Benefits Simplifies Big Data Processing on a Managed Hadoop Framework Low Cost Pay an hourly rate for every instance hour you use Reliable Spend less time tuning and monitoring your cluster Elastic Easily add or remove capacity Secure Managed firewall settings Easy to use Launch a cluster in minutes Flexible You control every cluster

Amazon EMR is Elastic Dynamically change number of compute resources Provision one or thousands of instances to Process Data at any scale Use CloudWatch to alert for scaling needs

Easy to Use Launch a cluster in minutes through console EMR sets up your cluster, provisions nodes, and configures Hadoop so you can focus on your analysis

Low Cost Low Hourly Pricing for On-demand Instance Hours Name your Price with Amazon EC2 Spot Instance Integration Use EC2 Reserved Instances to reserve capacity and further lower your costs

Reliable Monitors cluster health Retries failed tasks Automatically replaces under-performing instances

Secure Automatically configures firewall settings to control network access Managed security groups for instances or specify your own Launch in the Amazon Virtual Private Cloud (VPC) Support for server-side & client-side encryption

Flexible Completely control your cluster including customizations Use data sources from Amazon S3, HDFS, Redshift, RDS, Kinesis and DynamoDB Supports Apache & MapR Hadoop distributions Runs popular Hadoop tools and frameworks (Hive, Hbase, Pig, Mahout, Spark, Presto)

How do I get started using Amazon EMR?

Easy to start AWS Management Console Command Line Or use the Amazon EMR API with your favorite SDK.

How to use Amazon EMR 1. Upload your application and data to S3 App & Data Amazon S3 2. Configure your cluster: Choose Hadoop distribution, number and type of nodes, applications (Hive/ Pig/Hbase) 3. Launch your cluster using the console, CLI, SDK, or APIs Amazon EMR 4. Retrieve your output results from S3

Configuration & Customization

Choose your instance types Try different configurations to find your optimal architecture. General m1 family m3 family CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family Batch Machine Spark and Large process learning interactive HDFS

Resizable clusters Easy to add and remove compute capacity on your cluster. Match compute demands with cluster sizing.

Core Nodes Run TaskTrackers (Compute) Run DataNode (HDFS) Amazon EMR cluster Master instance group HDFS HDFS Core instance group

Core Nodes Can add core nodes More HDFS space More CPU/mem Amazon EMR cluster Master instance group HDFS HDFS HDFS Core instance group

Amazon EMR Task Nodes Run TaskTrackers No HDFS Reads from core node HDFS Amazon EMR cluster Master instance group HDFS HDFS Core instance group Task instance group

Amazon EMR Task Nodes Can Add Task Nodes Amazon EMR cluster Master instance group HDFS HDFS Core instance group Task instance group

Amazon EMR Task Nodes More CPU power More memory More network throughput Amazon EMR cluster Master instance group HDFS HDFS Core instance group Task instance group

Amazon EMR Task Nodes You can remove Task Nodes Amazon EMR cluster Master instance group HDFS HDFS Core instance group Task instance group

Amazon EMR Task Nodes Save Cost when Processing power is not required Amazon EMR cluster Master instance group HDFS HDFS Core instance group Task instance group

Easy to use Spot Instances Meet SLA at predictable cost Exceed SLA at lower cost On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Spot Instances for task nodes Up to 86% lower on average off on-demand pricing

Serial Processing COST: 4h x $2.1 = $8.4 RENDERING TIME: 4h

1 Machine for 10 Hours 10 Machines for 1 Hour

Parallel Processing COST: 2 x 2h x $2.1 = $8.4 RENDERING TIME:

Embarrassingly Parallel Processing COST: 4 x 1h x $2.1 = $8.4 RENDERING TIME:

Pick the right version

Pick your Hadoop Version & Applications EMR < 4.x: - AMI version had fixed Hadoop version, Hive version etc. - Additional applications via Bootstrap Actions Lots of different ones available from EMR & on github - Configuration via Bootstrap Actions https://github.com/awslabs/emr-bootstrap-actions

Pick your Hadoop Version & Applications Starting with EMR 4.x: - Release tags are independent of AMIs - Pick from supported applications directly Based on Apache BigTop - Configuration objects unified JSON syntax, mapped to configuration files Recommendation: Use EMR 4.x by default!

Transient or persistent cluster?

Transient or persistent cluster? Storing data on Amazon S3 give you more flexibility to run & shutdown Hadoop clusters when you need to.

Amazon S3 as your persistent data store Amazon S3 Designed for 99.999999999% durability Separate compute and storage Resize and shut down Amazon EMR clusters with no data loss Point multiple Amazon EMR clusters at same data in Amazon S3

Amazon EMR example #1: Batch processing GBs of logs pushed to Amazon S3 hourly Input and output stored in Amazon S3 Daily Amazon EMR cluster using Hive to process data 250 Amazon EMR jobs per day, processing 30 TB of data http://aws.amazon.com/solutions/case-studies/yelp/

Amazon EMR example #2: Long-running cluster Data pushed to Amazon S3 Daily Amazon EMR cluster Extract, Transform, and Load (ETL) data into database 24/7 Amazon EMR cluster running HBase holds last 2 years worth of data Front-end service uses HBase cluster to power dashboard with high concurrency

Amazon EMR example #3: Interactive query TBs of logs sent daily Logs stored in Amazon S3 Amazon EMR cluster using Presto for ad hoc analysis of entire log set Interactive query using Presto on multipetabyte warehouse http://techblog.netflix.com/2014/10/using-presto-in-our-bigdata-platform.html

Steps & Clusters

Applications

Vibrant Ecosystem

MR job MapReduce HDFS Amazon S3 Local filesystem

MR job MapReduce v2: YARN HDFS Amazon S3 Local filesystem

MR job MapReduce v2: YARN HDFS Amazon S3 Local filesystem

Cloudera Impala MR job MapReduce v2: YARN HDFS Amazon S3 Local filesystem

Presto MR job MapReduce v2: YARN HDFS Amazon S3 Local filesystem

M Drill MR job HDFS? Amazon S3 Lo

Amazon S3 and HDFS

Amazon S3 has near linear scalability 34 secs per terabyte S3 Streaming Performance 100 VMs; 9.6GB/s; $26/hr 350 VMs; 28.7GB/s; $90/hr Reader Connections GB/Second

EMRFS makes it easier to leverage Amazon S3 Better performance and error handling options Transparent to applications just read/write to s3:// Consistent view For consistent list and read-after-write for new puts Support for Amazon S3 server-side and client-side encryption Faster listing using EMRFS metadata

EMRFS support for Amazon S3 client-side encryption Amazon S3 encryption clients (client-side encrypted objects) Amazon S3 EMRFS enabled for Amazon S3 client-side encryption Key vendor (AWS KMS or your custom key vendor)

Fast listing of Amazon S3 objects using EMRFS metadata List and read-after-write consistency Faster list operations Number of objects 1,000,00 0 Without Consistent Views With Consistent Views 147.72 29.70 100,000 12.70 3.69 *Tested using a single node cluster with a m3.xlarge instance. Amazon S3 EMRFS metadata in Amazon DynamoDB

Optimize to leverage HDFS Iterative workloads If you re processing the same dataset more than once Disk I/O intensive workloads Persist data on Amazon S3 and use S3DistCp to copy to HDFS for processing.

Running Spark on Amazon EMR

Launch the latest Spark version July 15 Spark 1.4.1 GA release July 24 Spark 1.4.1 available on Amazon EMR September 9 Spark 1.5.0 GA release September 30 Spark 1.5.0 available on Amazon EMR < 3 week cadence with latest open source release

Amazon EMR runs Spark on YARN Applications Pig, Hive, Cascading, Spark Streaming, Spark SQL Batch MapReduce YARN Cluster Resource Management Storage S3, HDFS In Memory Spark Dynamically share and centrally configure the same pool of cluster resources across engines Schedulers for categorizing, isolating, and prioritizing workloads Choose the number of executors to use, or allow YARN to choose (dynamic allocation) Kerberos authentication

Create a fully configured cluster in minutes AWS Management Console AWS Command Line Interface (CLI) Or use an AWS SDK directly with the Amazon EMR API

Or easily change your settings

Many storage layers to choose from Amazon DynamoDB EMR-DynamoDB connector Amazon Redshift Copy From HDFS Amazon Redshift Amazon RDS JDBC Data Source w/ Spark SQL Elasticsearch connector Amazon EMR Streaming data connectors Amazon Kinesis EMR File System (EMRFS) Amazon S3

Decouple compute and storage by using S3 as your data layer EC2 Instance Memory Amazon EMR HDFS( Amazon EMR S3 is designed for 11 9 s of durability and is massively scalable Amazon EMR Amazon S3

Easy to run your Spark workloads Submit a Spark application Amazon EMR Step API Amazon EMR SSH to master node (Spark Shell)

Spark on YARN Fair Scheduler HDFS( YARN Capacity Scheduler Amazon EMR YARN is Hadoop 2 s Cluster Manager Spark Standalone Cluster Manager is FIFO, 100% Core Allocation Not ideal for multi-tenancy YARN Default is Capacity Scheduler Focus on Throughput Configure Queues Fair Scheduler also available Focus on even distribution

Why run Spark on YARN? YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN. You can throw your entire cluster at a MapReduce job, then use some of it on an Impala query and the rest on Spark application, without any changes in configuration. You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads. Spark standalone mode requires each application to run an executor on every node in the cluster, whereas with YARN you choose the number of executors to use. YARN is the only cluster manager for Spark that supports security. With YARN, Spark can run against Kerberized Hadoop clusters and uses secure

Task Scheduler Supports general task graphs Pipelines functions where possible Cache-aware data reuse & locality Partitioning-aware to avoid shuffles A: B: Stage 1 groupby C: D: E: Stage 2 map filter = RDD F: join Stage 3 = cached partition

Secure Spark clusters encryption at rest On-Cluster HDFS transparent encryption (AES 256) [new on release emr-4.1.0] Local disk encryption for temporary files using LUKS encryption via bootstrap action Amazon S3 Amazon S3 EMRFS support for Amazon S3 client-side and server-side encryption (AES 256)

Secure Spark clusters encryption in flight Internode communication on-cluster Blocks are encrypted in-transit in HDFS when using transparent encryption Spark s Broadcast and FileServer services can use SSL. BlockTransferService (for shuffle) can t use SSL (SPARK-5682). S3 to Amazon EMR cluster Secure communication with SSL Amazon S3 Objects encrypted over the wire if using client-side encryption

Secure Spark clusters additional features Permissions: Cluster level: IAM roles for the Amazon EMR service and the cluster Application level: Kerberos (Spark on YARN only) Amazon EMR service level: IAM users Access: VPC, security groups Auditing: AWS CloudTrail

Thank you!