Big Data Analytics Hadoop and Spark

Similar documents

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Architectures for massive data management

CSE-E5430 Scalable Cloud Computing Lecture 11

Spark. Fast, Interactive, Language- Integrated Cluster Computing

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Apache Spark and Distributed Programming

Big Data Processing. Patrick Wendell Databricks

Dell In-Memory Appliance for Cloudera Enterprise

Unified Big Data Processing with Apache Spark. Matei

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

How To Create A Data Visualization With Apache Spark And Zeppelin

Unified Big Data Analytics Pipeline. 连城

How Companies are! Using Spark

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

VariantSpark: Applying Spark-based machine learning methods to genomic information

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Brave New World: Hadoop vs. Spark

Spark and the Big Data Library

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Parallel Programming and MapReduce

Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

Big Data Analytics. Lucas Rego Drumond

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Introduction to Spark

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Scaling Out With Apache Spark. DTL Meeting Slides based on

Big Data With Hadoop

Spark: Cluster Computing with Working Sets

Hadoop and Map-Reduce. Swati Gore

Spark Application on AWS EC2 Cluster:

Big Data and Scripting Systems beyond Hadoop

Hadoop IST 734 SS CHUNG

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Improving MapReduce Performance in Heterogeneous Environments

Big Data. Lyle Ungar, University of Pennsylvania

Introduction to Hadoop

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Machine- Learning Summer School

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Moving From Hadoop to Spark

Big Data and Apache Hadoop s MapReduce

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Processing with Google s MapReduce. Alexandru Costan

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Map Reduce & Hadoop Recommended Text:

This is a brief tutorial that explains the basics of Spark SQL programming.

BIG DATA ANALYTICS For REAL TIME SYSTEM

BIG DATA APPLICATIONS

DduP Towards a Deduplication Framework utilising Apache Spark

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING

Hadoop Ecosystem B Y R A H I M A.

HiBench Introduction. Carson Wang Software & Services Group

Hybrid Software Architectures for Big

Beyond Hadoop with Apache Spark and BDAS

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Spark: Making Big Data Interactive & Real-Time

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Hadoop & Spark Using Amazon EMR

Big Data Frameworks Course. Prof. Sasu Tarkoma

BIG DATA What it is and how to use?

Real Time Data Processing using Spark Streaming

Apache Flink Next-gen data analysis. Kostas

Architectures for Big Data Analytics A database perspective

Managing large clusters resources

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Conquering Big Data with BDAS (Berkeley Data Analytics)

NoSQL and Hadoop Technologies On Oracle Cloud

Open source Google-style large scale data analysis with Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Data and Scripting Systems build on top of Hadoop

Suresh Lakavath csir urdip Pune, India

Spark Application Carousel. Spark Summit East 2015

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Big Systems, Big Data

Introduction to Hadoop

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Ali Ghodsi Head of PM and Engineering Databricks

Big Data on Microsoft Platform

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

A Brief Outline on Bigdata Hadoop

COMP9321 Web Application Engineering

Transcription:

Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1

What is Big Data? 2

What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data. Big data is a set of techniques and technologies that require new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a massive scale. (From Wikipedia) 3

What is Big Data? Volume. Value. Variety. Velocity - the speed of generation of data. Variability - the inconsistency which can be shown by the data at times. Veracity - the quality of the data being captured can vary greatly. Complexity - data management can become a very complex process, especially when large volumes of data come from multiple sources. (From Wikipedia) 4

How to analyze Big Data? 5

Map/Reduce 6

Map/Reduce MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster. Map step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed. Shuffle step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node. Reduce step: Worker nodes now process each group of output data, per key, in parallel. (From Wikipedia) 7

Basic Example: Word Count (Spark & Python) 8

Basic Example: Word Count (Spark & Scala) 9

Map/Reduce Parallel Computing No dependency among data Data can be split into equal-size chunks Each process can work on a chunk Master/worker approach Master: Initializes array and splits it according to # of workers Sends each worker the sub-array Receives the results from each worker Worker: Receives a sub-array from master Perfoms processing Sends results to master 10

11

12

Map/Reduce History Invented by Google: Uses: Inspired by functional programming languages map and reduce functions Seminal paper: Dean, Jeffrey & Ghemawat, Sanjay (OSDI 2004), "MapReduce: Simplified Data Processing on Large Clusters" Used at Google to completely regenerate Google's index of the World Wide Web. It replaced the old ad-hoc programs that updated the index and ran the various analysis distributed pattern-based searching, distributed sorting, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation Apache Hadoop: Open source implementation matches Google s specifications Amazon Elastic MapReduce running on Amazon EC2 13

Amazon Elastic MapReduce Source: http://docs.aws.amazon.com/elasticmapreduce/latest/developerguide/emr-what-is-emr.html 14

HDFS Architecture Source http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html 15

Hadoop & Object Storage 16

Apache Spark Apache Spark is a fast and general open-source engine for large-scale data processing. Includes the following libraries: SPARK SQL, SPARK Streaming, MLlib (Machine Learning) and GraphX (graph processing). Spark capable to run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark can run on Apache Mesos or Hadoop 2's YARN cluster manager, and can read any existing Hadoop data. Written in Scala language (a Java like, executed in Java VM) Apache Spark is built by a wide set of developers from over 50 companies. Since the project started in 2009, more than 400 developers have contributed to Spark. 17

Spark vs. Hadoop 18

Spark Cluster Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext (SC) object in your main. SC can connect to several types of resource cluster managers (either Spark s own standalone cluster manager or Mesos/YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are worker processes that run computations and store data for your application. Next, it sends your application code (JAR or Python files) to the executors. Finally, SC sends tasks for the executors to run. 19

Spark Cluster Example: Log Mining 20

Spark Cluster Example: Log Mining 21

Spark Cluster Example: Log Mining 22

Spark RDD 23

Spark & Scala: Creating RDD 24

Spark & Scala: Basic Transformations 25

Spark & Scala: Basic Actions 26

Spark & Scala: Key-Value Operations 27

Goal: Example: Spark Map/Reduce Find number of distinct names per "first letter". 28

Goal: Example: Spark Map/Reduce Find number of distinct names per "first letter". 29

Example: Spark Map/Reduce 30

Example: Page Rank 31

Example: Page Rank 32

33

[0 0.5 1 0.5] A = [1 0 0 0 ] [0 0 0 0.5] [0 0.5 0 0 ] [1] V = [1] [1] [1] B = 0.85*A U = 0.15*V B*V+U =? 34

[0 0.5 1 0.5] A = [1 0 0 0 ] [0 0 0 0.5] [0 0.5 0 0 ] [1] V = [1] [1] [1] B = 0.85*A U = 0.15*V [1.85 ] B*V+U = [1.0 ] [0.575] [0.575] 35

[0 0.5 1 0.5] A = [1 0 0 0 ] [0 0 0 0.5] [0 0.5 0 0 ] [1] V = [1] [1] [1] B = 0.85*A U = 0.15*V B*(B*V+U)+U =? 36

[0 0.5 1 0.5] A = [1 0 0 0 ] [0 0 0 0.5] [0 0.5 0 0 ] [1] V = [1] [1] [1] B = 0.85*A U = 0.15*V [1.31] B*(B*V+U)+U = [1.72] [0.39] [0.58] B*(B*(B*V+U)+U)+U =... 37

[0 0.5 1 0.5] A = [1 0 0 0 ] [0 0 0 0.5] [0 0.5 0 0 ] [1] V = [1] [1] [1] B = 0.85*A U = 0.15*V At the k-th step: B k V +(B k 1 +B k 2 +...+B 2 +B+I) U=B k V +(I B k ) (I B) 1 U For k=10: [1.43] [1.37] [0.46] [0.73] 38

[0 0.5 1 0.5] A = [1 0 0 0 ] [0 0 0 0.5] [0 0.5 0 0 ] [1] V = [1] [1] [1] B = 0.85*A U = 0.15*V Where k goes to infinity: [1.44] (I-B)^(-1)*U = [1.37] [0.46] [0.73] B k V 0 B k V+(I B k ) (I B) 1 U (I B) 1 U 39

[0 0.5 1 0.5] A = [1 0 0 0 ] B=0.85*A [0 0 0 0.5] [0 0.5 0 0 ] Characteristic polynomial of A: x 4 0.5x 2 0.25x 0.25 A is diagonalizable, 1 is the largest eigen value of A (in its absolute value), 1 corresponds to the eigen vector: [1.0 ] E = [1.0 ] [0.25] [0.5 ] Where k goes to infinity: A k V ce B k V 0 40

Example: Page Rank 41

Example: Page Rank 42

43

44

45

46

Spark Streaming Spark Streaming is an extension of the core Spark API that allows high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. 47

48

Machine Learning: K-Means clustering (from Wikipedia) 49

Goal: Example: Spark MLLib & Geolocation Segment tweets into clusters by geolocation using Spark MLLib K-means clustering https://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoidiagram-using-spark-and-mllib/ 50

Example: Spark MLLib & Geolocation 51

Example: Spark MLLib & Geolocation https://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoidiagram-using-spark-and-mllib/ 52

The Next Generation... https://amplab.cs.berkeley.edu/software/ 53

References Papers: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012, April 2012. Videos and Presentations: Spark workshop hosted by Stanford ICME, April 2014 - Reza Zadeh, Patrick Wendell, Holden Karau, Hossein Falaki, Xianguri Meng, http://stanford.edu/~rezab/sparkworkshop/ Spark summit 2014 - Matei Zaharia, Aaaron Davidson, http://spark-summit.org/2014/ Paul Krzyzanowski, "Distributed systems": //www.cs.rutgers.edu/~pxk/ Links: https://spark.apache.org/docs/latest/index.html http://hadoop.apache.org/ 54

55