Pepper: An Elastic Web Server Farm for Cloud based on Hadoop. Subramaniam Krishnan, Jean Christophe Counio Yahoo! Inc. MAPRED 1 st December 2010

Similar documents

GraySort and MinuteSort at Yahoo on Hadoop 0.23

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Extending Hadoop beyond MapReduce

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT

Hadoop & Spark Using Amazon EMR

Mobile Cloud Computing for Data-Intensive Applications

Introduction to Hadoop

A Performance Analysis of Distributed Indexing using Terrier

Research Article Hadoop-Based Distributed Sensor Node Management System

Chapter 7. Using Hadoop Cluster and MapReduce

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Efficient Data Replication Scheme based on Hadoop Distributed File System

Evaluating HDFS I/O Performance on Virtualized Systems

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System

Introduction to Hadoop

HDFS Space Consolidation

Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

How Cisco IT Built Big Data Platform to Transform Data Management

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop Architecture. Part 1

In Memory Accelerator for MongoDB

CDH AND BUSINESS CONTINUITY:

Benchmarking Hadoop & HBase on Violin

Cost-Effective Business Intelligence with Red Hat and Open Source

Viswanath Nandigam Sriram Krishnan Chaitan Baru

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Snapshots in Hadoop Distributed File System

Beyond Sampling: Fast, Whole- Dataset Analytics for Big Data on Hadoop

ESPRESSO: An Encryption as a Service for Cloud Storage Systems

Hadoop: Embracing future hardware

DESIGN AN EFFICIENT BIG DATA ANALYTIC ARCHITECTURE FOR RETRIEVAL OF DATA BASED ON

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

The Performance Characteristics of MapReduce Applications on Scalable Clusters

THE HADOOP DISTRIBUTED FILE SYSTEM

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Virtual Machine Based Resource Allocation For Cloud Computing Environment

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Data Pipeline with Kafka

Big Data - Infrastructure Considerations

A Cost-Evaluation of MapReduce Applications in the Cloud

HDP Hadoop From concept to deployment.

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Ground up Introduction to In-Memory Data (Grids)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Scaling Hadoop for Multi-Core and Highly Threaded Systems

Designing a Cloud Storage System

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Deploying Hadoop with Manager

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

OSG Hadoop is packaged into rpms for SL4, SL5 by Caltech BeStMan, gridftp backend

Ignify ecommerce. Item Requirements Notes

Distributed Storage Systems

Design and Evolution of the Apache Hadoop File System(HDFS)

Processing Large Amounts of Images on Hadoop with OpenCV

Entering the Zettabyte Age Jeffrey Krone

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

Cloud Computing and Advanced Relationship Analytics

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

BIG DATA TRENDS AND TECHNOLOGIES

6. How MapReduce Works. Jari-Pekka Voutilainen

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

Michael Thomas, Dorian Kcira California Institute of Technology. CMS Offline & Computing Week

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Parallel Computing. Benson Muite. benson.

IncidentMonitor Server Specification Datasheet

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Fault Tolerance in Hadoop for Work Migration

Big Data and Natural Language: Extracting Insight From Text

Improving Grid Processing Efficiency through Compute-Data Confluence

Implement Hadoop jobs to extract business value from large and varied data sets

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Purpose-Built Load Balancing The Advantages of Coyote Point Equalizer over Software-based Solutions

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Maximum performance, minimal risk for data warehousing

Transcription:

Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Subramaniam Krishnan, Jean Christophe Counio. MAPRED 1 st December 2010

Agenda Motivation Design Features Applications Evaluation Conclusion Future Work 1

Motivation 2

What s in a Name Wave 1: Grid-ification Crunch 10-100s of GBs of data in hours Large data like wikipedia Hosted, multitenant platform Grid workflow management system (PacMan) Wave 2: Content Freshness Process 100s of feeds/sec, size in KBs in seconds Web feeds like breaking news, tweets, finance quotes Scalable, high throughput & low latency platform Pepper elastic web server farm on grid Motivation 3

Requirements Elastic: handle intra/inter application load variance Multi-tenant: provide process/memory isolation Sub-second platform overhead Simple API Execute user code in platform context Reliability: transparent fault tolerance Motivation 4

Design 5

Deployment Flow Web application deployed as WAR onto HDFS Job Manager Embedded Jetty server runs in Map task, registers with ZooKeeper 1 Hadoop job = 1 Map task = 1 Web Server = 1 Web application Design 6

Processing Flow Proxy Router receives incoming requests, looks up ZooKeeper & redirects to appropriate Web Server Design 7

ZooKeeper Hierarchy Design 8

Features 9

Features Scalability: Web application can scale by configuring more instances (Elasticity), system can scale with addition of Hadoop nodes Performance: High throughput by ensuring that all the heavy lifting is done during deployment High Availability/Self-healing: Redundant server instances. Health check piggybacked on TaskTracker heartbeat Isolation: Hadoop map provides process isolation Ease of Development: Standard Servlet API & WAR packaging Reuse of Grid Infrastructure: The system runs on a Grid that can be shared across several applications Features 10

Applications 11

Applications Web Feeds Processing: Configure workflow orchestration engine to run in-memory, 1 workflow = 1 web-application. Benefits: Scalability Isolation Avoids Hadoop job bootstrap latency and HDFS small files bottleneck. Online Clustering: Extracts features and assigns incoming feeds to clusters predetermined by offline clustering. Performed online for Yahoo! News to identify hot news clusters during ingestion of articles. Applications 12

Evaluation 13

Setup Hardware: Intel Xeon L5420 2.50GHz with 8GB DDR2 RAM Software: 64-bit SUN JDK 1.6 update 18 on RHEL AS 4 U8, Linux 2.6.9-89.ELsmp x86_64 Configuration: 8 map slots/node with 512MB heap, 25 threads/jetty server Number of Computing Hadoop nodes: 3 Evaluation 14

Linear Scaling for Predefined Capacity Throughput: number of requests handled successfully per second for a specified number of tasks Evaluation 15

Elastic Scaling for Dynamic Capacity Rejection: failure to execute within predefined timeout Load is increased and additional map task allocated at points A and B based on predefined schedule Failure rate of < 0.001% observed in Production Evaluation 16

Pepper Performance Numbers System Burst Rate (request/mi n) Throughput (requests/da y) Platform Latency (Avg.) Response Time (Avg.) Pepper 2,000 3 million 75 ms 4s PacMan 50 10,000 90s 120s Dataset is Yahoo! News feeds with sizes < 1MB Processing is typically computation intensive like processing and enriching web feeds that involves validation, normalization, geo tagging, persistence in service stores, etc Evaluation 17

Conclusion 18

Conclusion Pepper marries the benefits of traditional server farms i.e. low latency and high throughput with those of cloud i.e. elasticity and isolation In production within Yahoo! from December 2009 Current Y! properties - Newspaper Consortium, Finance & News. Sports & Entertainment are in pipeline System scales linearly with addition of more Hadoop computing nodes Conclusion 19

Future Work 20

Future Work On demand allocation of servers Experimenting with async NIO between Proxy Router & Map Web Engine to increase scalability Improving distribution of requests across web servers Integrate into Hadoop (?) Future Work 21

References Hadoop, Web Page http://hadoop.apache.org/ J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Cluster, 6th Symposium on Operating Systems Design and Implementation (OSDI 04), San Francisco, CA, December 2004, pp. 137 150 P. Hunt, M. Konar, F.P. Junqueira, and B. Reed, ZooKeeper: Wait-free coordination for Internet-scale systems, Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, Boston, MA, June 2010, pp. 11-11 Oozie (successor to PacMan), Web Page http://yahoo.github.com/oozie/, http://www.cloudera.com/blog/2010/07/whats-new-incdh3-b2- oozie/ 22

Questions? 23