HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group



Similar documents
HiBench Installation. Sunil Raiyani, Jayam Modi

Implement Hadoop jobs to extract business value from large and varied data sets

MapReduce Evaluator: User Guide

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction. Various user groups requiring Hadoop, each with its own diverse needs, include:

Hadoop Ecosystem B Y R A H I M A.

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Beyond Hadoop with Apache Spark and BDAS

Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure

Unified Big Data Analytics Pipeline. 连 城

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Unified Big Data Processing with Apache Spark. Matei

MapReduce and Hadoop Distributed File System V I J A Y R A O

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

HPCHadoop: MapReduce on Cray X-series

Workshop on Hadoop with Big Data

Streaming items through a cluster with Spark Streaming

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Experiences with Lustre* and Hadoop*

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Energy Efficient MapReduce

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Moving From Hadoop to Spark

Architectures for massive data management

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

How To Create A Data Visualization With Apache Spark And Zeppelin

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

HADOOP. Revised 10/19/2015

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

COURSE CONTENT Big Data and Hadoop Training

Hadoop IST 734 SS CHUNG

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Complete Java Classes Hadoop Syllabus Contact No:

Ali Ghodsi Head of PM and Engineering Databricks

Performance and Energy Efficiency of. Hadoop deployment models

Hadoop implementation of MapReduce computational model. Ján Vaňo

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Synthetic Data Generation for Realistic Analytics Examples and Testing

MapReduce and Hadoop Distributed File System

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Big Data Frameworks: Scala and Spark Tutorial

Big Fast Data Hadoop acceleration with Flash. June 2013

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

The Internet of Things and Big Data: Intro

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

CSE-E5430 Scalable Cloud Computing Lecture 11

Hadoop & Spark Using Amazon EMR

Big Data Evaluator 2.1: User Guide

Big Data and Scripting Systems beyond Hadoop

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

BIG DATA APPLICATIONS

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Understanding Hadoop Performance on Lustre

GraySort on Apache Spark by Databricks

Connecting Hadoop with Oracle Database

Management & Analysis of Big Data in Zenith Team

Map Reduce & Hadoop Recommended Text:

Hadoop Size does Hadoop Summit 2013

Extending Hadoop beyond MapReduce

The Hadoop Eco System Shanghai Data Science Meetup

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Significantly Speed up real world big data Applications using Apache Spark

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Dell In-Memory Appliance for Cloudera Enterprise

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Big Data Course Highlights

Hadoop Job Oriented Training Agenda

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Next-Gen Big Data Analytics using the Spark stack

Big Data Analytics. Lucas Rego Drumond

Linux Performance Optimizations for Big Data Environments

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Apache Spark and Distributed Programming

Spark and the Big Data Library

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Large scale processing using Hadoop. Ján Vaňo

Apache Flink Next-gen data analysis. Kostas

Kafka & Redis for Big Data Solutions

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Managing large clusters resources

Big Data and Scripting map/reduce in Hadoop

Peers Techno log ies Pv t. L td. HADOOP

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

BIG DATA What it is and how to use?

Chapter 7. Using Hadoop Cluster and MapReduce

Transcription:

HiBench Introduction Carson Wang (carson.wang@intel.com)

Agenda Background Workloads Configurations Benchmark Report Tuning Guide

Background WHY Why we need big data benchmarking systems? WHAT What is HiBench? HOW How to use HiBench?

Big data ecosystem is complex Hadoop Spark Streaming Deployment Application MR1 Scala Spark Standalone SQL Storm YARN Java MachineLearning Storm-Trident Kafka MR2 Python Samza Zookeeper Graphx

Frequently Asked Questions from Customers Which framework is better? Hadoop, Spark Spark scala/java/python SparkStreaming, Storm, or Samza Standalone/YARN What is the performance difference between two deployments? How many resources are needed? CPU cores, memory, network bandwidth Is the cluster configured properly? Spark executor number, partition number tuning

HiBench Roadmap HiBench 2.0 (2013.9) HiBench 3.0 (2014.10) YARN support, Sparkbench HiBench 4.0 (2015.3) Workload abstraction framework HiBench 5.0 (2015.10) StreamingBench HiBench 1.0 (2012.6) CDH, hadoop2 support initial release

Workload abstraction Micro Benchmark Machine Learning Web Search SQL Streaming HDFS Benchmark

Sort Micro Benchmarks Sorts its text input data generated by RandomTextWriter. WordCount Counts the occurrence of each word in the input data, which are generated using RandomTextWriter. TeraSort TeraSort is a standard benchmark created by Jim Gray. Its input data is generated by Hadoop TeraGen example program. Sleep Sleeps a few of seconds in each task to test framework scheduler.

Bayes Machine Learning Benchmarks NaiveBayesian Classification implemented in Spark-MLLib/Mahout examples. A popular classification algorithm for knowledge discovery and data mining. K-Means Tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in Spark-Mllib/Mahout.

PageRank Web Search An implementation of Google s Web page ranking algorithm. Nutch Indexing Tests the indexing sub-system in Nutch, a popular open source (Apache project) search engine.

SQL Scan Join Aggregation Adapted from the Pavlo s benchmark "A Comparison of Approaches to Large-Scale Data Analysis" and HIVE-396. It contains queries performing the typical OLAP queries described in the paper.

Enhanced DFSIO HDFS Benchmark Tests the HDFS throughput of the Hadoop cluster by generating a large number of tasks performing writes and reads simultaneously. It measures the average I/O rate of each map task, the average throughput of each map task, and the aggregated throughput of the HDFS cluster.

Streaming Identity Sample Project Grep Wordcount Distinctcount Statistics

1. Build HiBench with Maven mvn clean package D spark1.5 D MR2 Quick Start 2. Add basic configurations in conf/99-user_defined_properties.conf hibench.hadoop.home The Hadoop installation location hibench.spark.home The Spark installation location hibench.hdfs.master hdfs://sr555:8020 hibench.spark.master yarn-client

Quick Start 3. Generate the input data and run the workload workloads/wordcount/prepare/prepare.sh workloads/wordcount/mapreduce/bin/run.sh workloads/wordcount/spark/scala/bin/run.sh workloads/wordcount/spark/java/bin/run.sh workloads/wordcount/spark/python/bin/run.sh 4. View the report View report/hibench.report to see the workload input data size, execution duration, throughtput(bytes/s), throughput per node.

Hierarchical configuration Configuration mechanism Global level conf/ Workload level workloads/<name>/conf/ Framework level workloads/<name>/mapreduce/hadoop.conf Language level workloads/<name>/spark/scala/scala.conf

Configuration - Parallelism Parallelism tuning Parallelism of workload Property name hibench.default.map.parallelism hibench.default.shuffle.parallelism Mapper numbers in MR, default parallelism in Spark Reducer number in MR, shuffle partition number in Spark Executor numbers & cores Property name hibench.yarn.executors.num hibench.yarn.executors.cores Number executors in YARN mode Number executor cores in YARN mode

Configuration - Memory Memory Property name spark.executors.memory spark.driver.memory Executor memory, standalone or YARN mode Driver memory, standalone or YARN mode

Data scale profile definition: conf/10-data-scale-profile.conf Configuration Data Scale #Wordcount hibench.wordcount.tiny.datasize 32000 hibench.wordcount.small.datasize 320000000 hibench.wordcount.large.datasize 3200000000 hibench.wordcount.huge.datasize 32000000000 hibench.wordcount.gigantic.datasize 320000000000 hibench.wordcount.bigdata.datasize 1600000000000 Data scale profile selection: hibench.scale.profile large

Benchmark Report Each report/workload/framework/language has a dedicated folder which contains: Console output log System utilization log Configurations files Bash environment conf file used by workload scripts: <workload_name>.conf All-in-one property based conf generated by all confs: sparkbench.conf Spark/Samza/ related conf extracted from all confs: spark.conf, samza.conf,

Report Example./report./report/hibench.report Summarized workload report, including workload name, execution duration, data size, throughput per cluster, throughput per node./report/<workload>/<language_api>/bench.log Driver log when running this workload/api./report/<workload>/<language_api>/monitor.html System utilization monitor chart on CPU, Disk, Network, Memory, Proc Load./report/<workload>/<language_api>/conf/<workload>.conf Environment variable configuration generated for this workload./report/<workload>/<language_api>/conf/sparkbench/sparkbench.conf Properties configuration generated for this workload./report/<workload>/<language_api>/conf/sparkbench/spark.conf Spark properties configuration generated for this workload

Spark Workloads Tuning Guide Determine executor cores and numbers Determine parallelism Determine partition size Monitor system utilization

Determine Executor cores and numbers Set N=Number of virtual CPU cores in one node. Set M=Number of worker nodes in the cluster. Set executor cores to a number around 5: Executor_cores=5 Calculate executor number running in each node: Executor_number_in_node = (N -1) / Executor_cores Calculate total executor numbers Executor_number= Executor_number_in_node * M

Calculate Parallelism: Determine Parallelism P = Executor_cores * Executor_number Calculate memory for each executor: Executor_memory <= Memory_node / Executor_number_in_node * 0.9

Determine partition size Partition size should >= P(parallelism) But, a right partition size will vary with real cases. Low parallelism: Partition size less than P: not enough tasks for all vcores Partition size too large: Tasks finished too fast. Task scheduler can t feed enough tasks. Out of memory: Partition size too small. Guideline Increase partition size starting from P until performance begin to drop. P Best Parallelism Boundary of enough Memory

Hardware Specs CPU E5-2699 v3 CPU Core# 18 CPU Socket# 1 Threads Per Core# 2 Memory capacity Disk (SSD) Tuning Example 256 GB 4 x 400 GB N = 36 M = 3 Executor Cores = 5 Executor Number in one node = (36-1) / 5 = 7 Total Executor Number = 7 * 3 = 21 Executor Memory <= 256 / 7 x 0.9 = 32 GB Parallelism P = 5 * 21 = 105 Partition Size = 105, 210, 315, etc NIC Cluster Size 10 Gb 3 Slaves

Spark Tuning Example - TeraSort

Spark Tuning Example - KMeans Low CPU usage. Too less input Data! RDDs produced by textfile (sequencefile, etc) have the partitions determined by input data size and HDFS block size

HiBench 6.0 What s Next More real-world workloads More flexible build and configuration A better resource monitoring service

Q & A Thanks