A Brief Introduction of Existing Big Data Tools. Bingjing Zhang



Similar documents
arp: Collective Communication on Hadoop Judy Qiu, Indiana University

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Apache Flink Next-gen data analysis. Kostas

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Hadoop Ecosystem B Y R A H I M A.

Unified Big Data Analytics Pipeline. 连 城

The State of Big Data: Use Cases and the Ogre patterns

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data and Scripting Systems build on top of Hadoop

Unified Big Data Processing with Apache Spark. Matei

CSE-E5430 Scalable Cloud Computing Lecture 2

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

How Companies are! Using Spark

Apache Hadoop Ecosystem

Challenges for Data Driven Systems

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Big Data and Apache Hadoop s MapReduce

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Dell In-Memory Appliance for Cloudera Enterprise

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Big Data With Hadoop

Spark. Fast, Interactive, Language- Integrated Cluster Computing

MapReduce with Apache Hadoop Analysing Big Data

Large-Scale Data Processing

Big Data Analytics Hadoop and Spark

Beyond Hadoop with Apache Spark and BDAS

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

How To Create A Data Visualization With Apache Spark And Zeppelin

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey

NoSQL Data Base Basics

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

A Brief Introduction to Apache Tez

YARN Apache Hadoop Next Generation Compute Platform

Hadoop and Map-Reduce. Swati Gore

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

Managing large clusters resources

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data and Scripting Systems beyond Hadoop

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

CSE-E5430 Scalable Cloud Computing Lecture 11

Deploying Hadoop with Manager

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Scalable Cloud Computing

6.S897 Large-Scale Systems

Case Study : 3 different hadoop cluster deployments

Moving From Hadoop to Spark

The Stratosphere Big Data Analytics Platform

Upcoming Announcements

Introduction to Big Data Training

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Map-Based Graph Analysis on MapReduce

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Workshop on Hadoop with Big Data

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Survey of Apache Big Data Stack

Scaling Out With Apache Spark. DTL Meeting Slides based on

HDP Hadoop From concept to deployment.

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: (O) Volume 1 Issue 3 (September 2014)

Why Spark on Hadoop Matters

Cloud Scale Distributed Data Storage. Jürmo Mehine

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Lecture Data Warehouse Systems

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

HADOOP. Revised 10/19/2015

Ali Ghodsi Head of PM and Engineering Databricks

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data Course Highlights

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Twister4Azure: Data Analytics in the Cloud

How to Hadoop Without the Worry: Protecting Big Data at Scale

Apache Spark and Distributed Programming

Architectures for massive data management

Introduction to Hadoop

Map Reduce & Hadoop Recommended Text:

Dominik Wagenknecht Accenture

Machine- Learning Summer School

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Analytics. Lucas Rego Drumond

HiBench Introduction. Carson Wang Software & Services Group

How To Scale Out Of A Nosql Database

Transcription:

A Brief Introduction of Existing Big Data Tools Bingjing Zhang

Outline The world map of big data tools Layered architecture Big data tools for HPC and supercomputing MPI Big data tools on clouds MapReduce model Iterative MapReduce model DAG model Graph model Collective model Machine learning on big data Query on big data Stream data processing

The World of Big Data Tools DAG Model MapReduce Model Graph Model BSP/Collective Model Hadoop For Iterations/ Learning For Query Dryad/ DryadLINQ Drill Spark Pig/PigLatin Hive Tez Shark HaLoop Twister Harp Stratosphere Reef MRQL MPI Giraph Hama GraphLab GraphX For Streaming S4 Samza Storm Spark Streaming

Layered Architecture (Upper) NA Non Apache projects Green layers are Apache/Commercial Cloud (light) to HPC (darker) integration layers Orchestration & Workflow Oozie, ODE, Airavata and OODT (Tools) NA: Pegasus, Kepler, Swift, Taverna, Trident, ActiveBPEL, BioKepler, Galaxy Cross Cutting Capabilities Monitoring: Ambari, Ganglia, Nagios, Inca (NA) Security & Privacy Distributed Coordination: ZooKeeper, JGroups Message Protocols: Thrift, Protobuf (NA) Machine Learning Mahout, MLlib, MLbase CompLearn (NA) Hive (SQL on Hadoop) Hcatalog Interfaces Data Analytics Libraries: Statistics, Bioinformatics R, Bioconductor (NA) High Level (Integrated) Systems for Data Processing Pig (Procedural Language) Shark (SQL on Spark, NA) Parallel Horizontally Scalable Data Processing Hadoop Spark NA:Twister Tez Hama (Map (Iterative Storm Stratosphere (DAG) (BSP) Reduce) MR) Iterative MR Batch ABDS Inter-process Communication Hadoop, Spark Communications & Reductions Harp Collectives (NA) Pub/Sub Messaging Imagery ImageJ (NA) MRQL (SQL on Hadoop, Hama, Spark) S4 Yahoo Stream Linear Algebra Scalapack, PetSc (NA) Impala (NA) Cloudera (SQL on Hbase) Samza LinkedIn Netty (NA)/ZeroMQ (NA)/ActiveMQ/Qpid/Kafka Giraph ~Pregel Swazall (Log Files Google NA) Graph HPC Inter-process Communication MPI (NA) Pegasus on Hadoop (NA) The figure of layered architecture is from Prof. Geoffrey Fox

Layered Architecture (Lower) NA Non Apache projects Green layers are Apache/Commercial Cloud (light) to HPC (darker) integration layers Cross Cutting Capabilities Monitoring: Ambari, Ganglia, Nagios, Inca (NA) Security & Privacy Distributed Coordination: ZooKeeper, JGroups Message Protocols: Thrift, Protobuf (NA) In memory distributed databases/caches: GORA (general object from NoSQL), Memcached (NA), Redis(NA) (key value), Hazelcast (NA), Ehcache (NA); ORM Object Relational Mapping: Hibernate(NA), OpenJPA and JDBC Standard Extraction Tools SQL SciDB NoSQL: Column Solandra UIMA (NA) Tika MySQL Phoenix HBase Accumulo (Solr+ (Entities) (SQL on Arrays, Cassandra (Content) (NA) (Data on (Data on Cassandra) (Watson) R,Python (DHT) HBase) HDFS) HDFS) +Document NoSQL: Document MongoDB CouchDB (NA) NoSQL: General Graph Neo4J Java Gnu (NA) Yarcdata Commercial (NA) Lucene Solr ABDS Cluster Resource Management Mesos, Yarn, Helix, Llama(Cloudera) NoSQL: Key Value (all NA) Berkeley DB Azure Table Dynamo Amazon NoSQL: TripleStore RDF SparkQL Jena Sesame (NA) AllegroGraph Commercial Riak ~Dynamo HPC Cluster Resource Management Condor, Moab, Slurm, Torque(NA).. ABDS File Systems User Level HPC File Systems (NA) HDFS, Swift, Ceph FUSE(NA) Gluster, Lustre, GPFS, GFFS Object Stores POSIX Interface Distributed, Parallel, Federated Interoperability Layer Whirr / JClouds OCCI CDMI (NA) DevOps/Cloud Deployment Puppet/Chef/Boto/CloudMesh(NA) Voldemort ~Dynamo File Management RYA RDF on Accumulo irods(na) Data Transport BitTorrent, HTTP, FTP, SSH Globus Online (GridFTP) IaaS System Manager Open Source Commercial Clouds OpenStack, OpenNebula, Eucalyptus, CloudStack, vcloud, Amazon, Azure, Google Bare Metal The figure of layered architecture is from Prof. Geoffrey Fox

Big Data Tools for HPC and Supercomputing MPI(Message Passing Interface, 1992) Provide standardized function interfaces for communication between parallel processes. Collective communication operations Broadcast, Scatter, Gather, Reduce, Allgather, Allreduce, Reduce-scatter. Popular implementations MPICH (2001) OpenMPI (2004) http://www.open-mpi.org/

MapReduce Model Google MapReduce (2004) Jeffrey Dean et al. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004. Apache Hadoop (2005) http://hadoop.apache.org/ http://developer.yahoo.com/hadoop/tutorial/ Apache Hadoop 2.0 (2012) Vinod Kumar Vavilapalli et al. Apache Hadoop YARN: Yet Another Resource Negotiator, SOCC 2013. Separation between resource management and computation model.

Key Features of MapReduce Model Designed for clouds Large clusters of commodity machines Designed for big data Support from local disks based distributed file system (GFS / HDFS) Disk based intermediate data transfer in Shuffling MapReduce programming model Computation pattern: Map tasks and Reduce tasks Data abstraction: KeyValue pairs

Google MapReduce Mapper: split, read, emit intermediate KeyValue pairs Split 0 Split 1 Split 2 Worker (3) read (4) local write Worker Worker (1) fork (1) fork (1) fork (2) assign map User Program Master (2) assign reduce Reducer: repartition, emits final output Worker Worker (5) remote read (6) write Output File 0 Output File 1 Input files Map phase Intermediate files (on local disks) Reduce phase Output files

Iterative MapReduce Model Twister (2010) Jaliya Ekanayake et al. Twister: A Runtime for Iterative MapReduce. HPDC workshop 2010. http://www.iterativemapreduce.org/ Simple collectives: broadcasting and aggregation. HaLoop (2010) Yingyi Bu et al. HaLoop: Efficient Iterative Data Processing on Large clusters. VLDB 2010. http://code.google.com/p/haloop/ Programming model RR ii+1 = RR 0 RR ii LL Loop-Aware Task Scheduling Caching and indexing for Loop-Invariant Data on local disk

Twister Programming Model Main program s process space Worker Nodes configuremaps( ) configurereduce( ) Local Disk while(condition){ Cacheable map/reduce tasks runmapreduce(...) May scatter/broadcast <Key,Value> Map() Iterations pairs directly May merge data in shuffling Reduce() Combine() Communications/data transfers via the operation pub-sub broker network & direct TCP updatecondition() } //end while close() Main program may contain many MapReduce invocations or iterative MapReduce invocations

DAG (Directed Acyclic Graph) Model Dryad and DryadLINQ (2007) Michael Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks, EuroSys, 2007. http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx

Model Composition Apache Spark (2010) Matei Zaharia et al. Spark: Cluster Computing with Working Sets,. HotCloud 2010. Matei Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012. http://spark.apache.org/ Resilient Distributed Dataset (RDD) RDD operations MapReduce-like parallel operations DAG of execution stages and pipelined transformations Simple collectives: broadcasting and aggregation

Graph Processing with BSP model Pregel (2010) Grzegorz Malewicz et al. Pregel: A System for Large-Scale Graph Processing. SIGMOD 2010. Apache Hama (2010) https://hama.apache.org/ Apache Giraph (2012) https://giraph.apache.org/ Scaling Apache Giraph to a trillion edges https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillionedges/10151617006153920

Pregel & Apache Giraph Computation Model Superstep as iteration Vertex state machine: Active and Inactive, vote to halt Message passing between vertices Combiners Aggregators Topology mutation Master/worker model Graph partition: hashing Fault tolerance: checkpointing and confined recovery Vote to halt 3 6 2 1 6 6 2 6 6 6 6 6 6 6 6 6 Maximum Value Example Active Superstep 0 Superstep 1 Superstep 2 Superstep 3

Giraph Page Rank Code Example public class PageRankComputation extends BasicComputation<IntWritable, FloatWritable, NullWritable, FloatWritable> { /** Number of supersteps */ public static final String SUPERSTEP_COUNT = "giraph.pagerank.superstepcount"; @Override public void compute(vertex<intwritable, FloatWritable, NullWritable> vertex, Iterable<FloatWritable> messages) throws IOException { if (getsuperstep() >= 1) { float sum = 0; for (FloatWritable message : messages) { sum += message.get(); } vertex.getvalue().set((0.15f / gettotalnumvertices()) + 0.85f * sum); } if (getsuperstep() < getconf().getint(superstep_count, 0)) { sendmessagetoalledges(vertex, new FloatWritable(vertex.getValue().get() / vertex.getnumedges())); } else { vertex.votetohalt(); } } } Deployment and Execution https://giraph.apache.org/quick_start.html#qs_section_4

GraphLab (2010) Yucheng Low et al. GraphLab: A New Parallel Framework for Machine Learning. UAI 2010. Yucheng Low, et al. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB 2012. http://graphlab.org/projects/index.html http://graphlab.org/resources/publications.html Data graph Update functions and the scope Sync operation (similar to aggregation in Pregel)

Data Graph

Vertex-cut v.s. Edge-cut PowerGraph (2012) Joseph E. Gonzalez et al. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. OSDI 2012. Gather, apply, Scatter (GAS) model GraphX (2013) Reynold Xin et al. GraphX: A Resilient Distributed Graph System on Spark. GRADES (SIGMOD workshop) 2013. https://amplab.cs.berkeley.edu/publicatio n/graphx-grades/ Edge-cut (Giraph model) Vertex-cut (GAS model)

To reduce communication overhead. Option 1 Algorithmic message reduction Fixed point-to-point communication pattern Option 2 Collective communication optimization Not considered by previous BSP model but well developed in MPI Initial attempts in Twister and Spark on clouds Mosharaf Chowdhury et al. Managing Data Transfers in Computer Clusters with Orchestra. SIGCOMM 2011. Bingjing Zhang, Judy Qiu. High Performance Clustering of Social Images in a Map- Collective Programming Model. SOCC Poster 2013.

Collective Model Harp (2013) https://github.com/jessezbj/harp-project Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0) Hierarchical data abstraction on arrays, key-values and graphs for easy programming expressiveness. Collective communication model to support various communication operations on the data abstractions. Caching with buffer management for memory allocation required from computation and communication BSP style parallelism Fault tolerance with check-pointing

Harp Design Parallelism Model Architecture MapReduce Model Map-Collective Model Application MapReduce Applications Map-Collective Applications M M M M Shuffle M M M M Collective Communication Framework MapReduce V2 Harp R R Resource Manager YARN

Hierarchical Data Abstraction and Collective Communication Table Array Table <Array Type> Edge Table Broadcast, Allgather, Allreduce, Regroup-(combine/reduce), Message-to-Vertex, Edge-to-Vertex Message Vertex KeyValue Table Table Table Partition Array Partition < Array Type > Edge Partition Message Partition Vertex Partition KeyValue Partition Broadcast, Send Long Array Int Array Double Array Byte Array Vertices, Edges, Messages Key-Values Basic Types Array Struct Object Broadcast, Send, Gather Commutable

Harp Bcast Code Example protected void mapcollective(keyvalreader reader, Context context) throws IOException, InterruptedException { ArrTable<DoubleArray, DoubleArrPlus> table = new ArrTable<DoubleArray, DoubleArrPlus>(0, DoubleArray.class, DoubleArrPlus.class); if (this.ismaster()) { String cfile = conf.get(kmeansconstants.cfile); Map<Integer, DoubleArray> cendatamap = createcendatamap(cparsize, rest, numcenpartitions, vectorsize, this.getresourcepool()); loadcentroids(cendatamap, vectorsize, cfile, conf); addpartitionmaptotable(cendatamap, table); } } arrtablebcast(table);

Pipelined Broadcasting with Topology-Awareness 25 20 15 10 5 Twister vs. MPI (Broadcasting 0.5~2GB data) 40 35 30 25 20 15 10 5 Twister vs. MPJ (Broadcasting 0.5~2GB data) 0 1 25 50 75 100 125 150 Number of Nodes Twister Bcast 500MB MPI Bcast 500MB Twister Bcast 1GB MPI Bcast 1GB Twister Bcast 2GB MPI Bcast 2GB 0 1 25 50 75 100 125 150 Number of Nodes Twister 0.5GB MPJ 0.5GB Twister 1GB MPJ 1GB Twister 2GB 80 70 60 50 40 30 20 10 0 Twister vs. Spark (Broadcasting 0.5GB data) 1 25 50 75 100 125 150 Number of Nodes 1 receiver #receivers = #nodes #receivers = #cores (#nodes*8) Twister Chain 100 90 80 70 60 50 40 30 20 10 0 Twister Chain with/without topology-awareness 1 25 50 75 100 125 150 Number of Nodes 0.5GB 0.5GB W/O TA 1GB 1GB W/O TA 2GB 2GB W/O TA Tested on IU Polar Grid with 1 Gbps Ethernet connection

K-Means Clustering Performance on Madrid Cluster (8 nodes) 1600 K-Means Clustering Harp v.s. Hadoop on Madrid 1400 1200 Execution Time (s) 1000 800 600 400 200 0 100m 500 10m 5k 1m 50k Problem Size Hadoop 24 cores Harp 24 cores Hadoop 48 cores Harp 48 cores Hadoop 96 cores Harp 96 cores

K-means Clustering Parallel Efficiency Shantenu Jha et al. A Tale of Two Data- Intensive Paradigms: Applications, Abstractions, and Architectures. 2014.

WDA-MDS Performance on Big Red II WDA-MDS Yang Ruan, Geoffrey Fox. A Robust and Scalable Solution for Interpolative Multidimensional Scaling with Weighting. IEEE e-dcience 2013. Big Red II http://kb.iu.edu/data/bcqt.html Allgather Bucket algorithm Allreduce Bidirectional exchange algorithm

Execution Time of 100k Problem 3000 Execution Time of 100k Problem 2500 Execution Time (Seconds) 2000 1500 1000 500 0 0 20 40 60 80 100 120 140 Number of Nodes (8, 16, 32, 64, 128 nodes, 32 cores per node)

Parallel Efficiency Based On 8 Nodes and 256 Cores 1.2 Parallel Efficiency (Based On 8 Nodes and 256 Cores) 1 0.8 0.6 0.4 0.2 0 0 20 40 60 80 100 120 140 Number of Nodes (8, 16, 32, 64, 128 nodes) 4096 partitions (32 cores per node)

Scale Problem Size (100k, 200k, 300k) 3500 Scaling Problem Size on 128 nodes with 4096 cores 3000 2877.757 Execution Time (Seconds) 2500 2000 1500 1000 1643.081 500 368.386 0 100000 200000 300000 Problem Size

Machine Learning on Big Data Mahout on Hadoop https://mahout.apache.org/ MLlib on Spark http://spark.apache.org/mllib/ GraphLab Toolkits http://graphlab.org/projects/toolkits.html GraphLab Computer Vision Toolkit

Query on Big Data Query with procedural language Google Sawzall (2003) Rob Pike et al. Interpreting the Data: Parallel Analysis with Sawzall. Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure 2003. Apache Pig (2006) Christopher Olston et al. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. https://pig.apache.org/

SQL-like Query Apache Hive (2007) Facebook Data Infrastructure Team. Hive - A Warehousing Solution Over a Map- Reduce Framework. VLDB 2009. https://hive.apache.org/ On top of Apache Hadoop Shark (2012) Reynold Xin et al. Shark: SQL and Rich Analytics at Scale. Technical Report. UCB/EECS 2012. http://shark.cs.berkeley.edu/ On top of Apache Spark Apache MRQL (2013) http://mrql.incubator.apache.org/ On top of Apache Hadoop, Apache Hama, and Apache Spark

Other Tools for Query Apache Tez (2013) http://tez.incubator.apache.org/ To build complex DAG of tasks for Apache Pig and Apache Hive On top of YARN Dremel (2010) Apache Drill (2012) Sergey Melnik et al. Dremel: Interactive Analysis of Web-Scale Datasets. VLDB 2010. http://incubator.apache.org/drill/index.html System for interactive query

Stream Data Processing Apache S4 (2011) http://incubator.apache.org/s4/ Apache Storm (2011) http://storm.incubator.apache.org/ Spark Streaming (2012) https://spark.incubator.apache.org/streaming/ Apache Samza (2013) http://samza.incubator.apache.org/

REEF Retainable Evaluator Execution Framework http://www.reef-project.org/ Provides system authors with a centralized (pluggable) control flow Embeds a user-defined system controller called the Job Driver Event driven control Package a variety of data-processing libraries (e.g., high-bandwidth shuffle, relational operators, low-latency group communication, etc.) in a reusable form. To cover different models such as MapReduce, query, graph processing and stream data processing

Thank You! Questions?