Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Transcription

1 Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59

2 Introduction Contents 1 Introduction Big Data Parallel/Distributed computing Distributed storage Grid Computing Cloud Computing 2 MapReduce Programming model Framework Algorithms 3 Other frameworks SPARK STORM 4 Graph Processing Motivations Pregel model Sample Algorithms: SSSP Available implementations Blogel 5 Q and A Sabeur Aridhi Frameworks for Big Data Analytics 2 / 59

3 Introduction Big Data Growth of data size Data sets We need to analyze bigger and bigger datasets: Web datasets: web topology (Google: est. 46 billion pages) Network datasets: routing efficiency analysis, tcp-ip logs Biological datasets: genome sequencing data, protein patterns Big Data Three main characteristics: Volume: Huge size of datasets (i.e., distributed storage). Variety: Several types of data such as text, image and video. Velocity: Data is generated at high speed (i.e., CERN experiments: one petabyte /sec). Sabeur Aridhi Frameworks for Big Data Analytics 3 / 59

4 Introduction Big Data Key sources of Big Data Internet of Things (IoT) Networking of hardware equipment (e.g., household appliances, sensors, mobile pohones) A significant source of Big Data Sabeur Aridhi Frameworks for Big Data Analytics 4 / 59

5 Introduction Big Data Key sources of Big Data Smart cities Use of information technology to enhance quality, performance and interactivity of urban services in a city. Data is collected from sensors installed on: utility poles water lines buses trains traffic lights,... Sabeur Aridhi Frameworks for Big Data Analytics 5 / 59

6 Introduction Big Data Key sources of Big Data - Smart cities Sabeur Aridhi Frameworks for Big Data Analytics 6 / 59

7 Introduction Big Data Big Data in everyday life Sabeur Aridhi Frameworks for Big Data Analytics 7 / 59

8 Introduction Big Data Big Data in everyday life Does one minute fit into RAM? Five Minutes? Sabeur Aridhi Frameworks for Big Data Analytics 7 / 59

9 Introduction Parallel/Distributed computing Parallel/Distributed Computing Approaches Pros Parallel algorithms Distributed algorithms Cheaper (scaling out vs. scaling up) Scalable Cons Needs theory of distributed algorithms (How to distribute computation?) Cluster maintenance (fault tolerance, scheduling,...) Sabeur Aridhi Frameworks for Big Data Analytics 8 / 59

10 Introduction Parallel/Distributed computing Parallel Computing Parallel vs. Distributed A parallel algorithm can have: A global clock Shared memory How to develop parallel computing programs? Hands-on approach Libraries available Still lot of manual set-up required (set-up threads, distribute data,... ) Sabeur Aridhi Frameworks for Big Data Analytics 9 / 59

11 Introduction Parallel/Distributed computing Distributed Computing (DC) Requires a Programming Model (PM) that is inherently parallelizable. Requires a framework that runs the program and provides fault-tolerance and scalability. Sabeur Aridhi Frameworks for Big Data Analytics 10 / 59

12 Introduction Parallel/Distributed computing Distributed Computing (DC) A programming model for DC should: abstract all low level details such as networking, scheduling,... allow for wide range of functionality be easy to use A framework for DC should provide: fault tolerance scalability data integrity Sabeur Aridhi Frameworks for Big Data Analytics 11 / 59

13 Introduction Distributed storage Distributed File System Big Data requires storage over many machines Each computing node needs to access the data. A distributed file system (DFS) abstracts the distributed storage A widely used file system (FS) is Google DFS Google s DFS data is replicated across the network. (Failure resistent, faster read operations). Sabeur Aridhi Frameworks for Big Data Analytics 12 / 59

14 Introduction Distributed storage Distributed File System Big Data requires storage over many machines Each computing node needs to access the data. A distributed file system (DFS) abstracts the distributed storage A widely used file system (FS) is Google DFS Google s DFS data is replicated across the network. (Failure resistent, faster read operations). namenode stores file metadata e.g. name, location of chunks Sabeur Aridhi Frameworks for Big Data Analytics 12 / 59

15 Introduction Distributed storage Distributed File System Big Data requires storage over many machines Each computing node needs to access the data. A distributed file system (DFS) abstracts the distributed storage A widely used file system (FS) is Google DFS Google s DFS data is replicated across the network. (Failure resistent, faster read operations). namenode stores file metadata e.g. name, location of chunks datanode stores chunks of that file Sabeur Aridhi Frameworks for Big Data Analytics 12 / 59

16 Introduction Grid Computing Grid Computing - Definition A set of information resources (computers, databases, networks,...). It provides users with tools and application that treat those resources within a virtual system. Sabeur Aridhi Frameworks for Big Data Analytics 13 / 59

17 Introduction Grid Computing Grid Computing - What is Grid Computing? Grid computing is the collection of computer resources from multiple locations to reach a common goal Characteristics of a Grid: no centralized control center scalability heterogeneity (of resources) dynamic and adaptable Sabeur Aridhi Frameworks for Big Data Analytics 14 / 59

18 Introduction Cloud Computing Cloud Computing Definition Large number of computers that are connected via Internet. Applications delivered as services. Computing resources delivered as services. Pay as you go. Cloud services can be rapidly and elastically provisioned. The framework can be offered as a service by a provider e.g. Amazon EC2 Sabeur Aridhi Frameworks for Big Data Analytics 15 / 59

19 Introduction Cloud Computing Layers of Cloud Computing Service models Software as a Service (SaaS). Platform as a Service (PaaS), Infrastructure as a Service (IaaS), Sabeur Aridhi Frameworks for Big Data Analytics 16 / 59

20 MapReduce Contents 1 Introduction Big Data Parallel/Distributed computing Distributed storage Grid Computing Cloud Computing 2 MapReduce Programming model Framework Algorithms 3 Other frameworks SPARK STORM 4 Graph Processing Motivations Pregel model Sample Algorithms: SSSP Available implementations Blogel 5 Q and A Sabeur Aridhi Frameworks for Big Data Analytics 17 / 59

21 MapReduce Definition Google, 2004: MapReduce refers to a programming model and the corresponding implementation for processing and generating large data sets. Created to help Google developers to analyze huge datasets Sabeur Aridhi Frameworks for Big Data Analytics 18 / 59

22 MapReduce Programming model Map and Reduce in MapReduce Data format: key-value pairs The user (you) must define two functions: Mapper Map function: (k 1, v 1 ) list(k 2, v 2 ) Reducer Reduce function: (k 2, list(v 2 )) list(v 3 ) The Map functions are executed in parallel. k 2, v 2 are intermediate pairs. Mapper results are grouped using k 2. The Reduce functions are executed in parallel. Sabeur Aridhi Frameworks for Big Data Analytics 19 / 59

23 MapReduce Programming model MapReduce data flow (simplified) Sabeur Aridhi Frameworks for Big Data Analytics 20 / 59

24 MapReduce Programming model Basic steps of a MapReduce program: Input is read from file system (fs) The Map function is executed in parallel. These results are written to fs Input is read from fs The Reduce function is executed in parallel. These results are written to fs Sabeur Aridhi Frameworks for Big Data Analytics 21 / 59

25 MapReduce Programming model Example: Word Count (MapReduce s Hello World) Input A set of documents stored in a DFS Output The number of occurrences of each word in the set of documents. Sabeur Aridhi Frameworks for Big Data Analytics 22 / 59

26 MapReduce Programming model Example: Word Count (MapReduce s Hello World) Input A set of documents stored in a DFS Output The number of occurrences of each word in the set of documents. Idea Work on each document in parallel and then reduce the results for each word. Sabeur Aridhi Frameworks for Big Data Analytics 22 / 59

27 MapReduce Programming model Example: Word Count (MapReduce s Hello World) Algorithm: Map function Input: String filename, String content foreach word w in content do EmitIntermediate(w,1); Algorithm: Reduce function Input: String key, Iterator values result 0; foreach v in values do result result + v; Emit (key,result); Sabeur Aridhi Frameworks for Big Data Analytics 23 / 59

28 MapReduce Programming model Possible optimization Creating a pair for each occurrence of a particular word is inefficient Algorithm: Optimized Map function Input: String filename, String content H new HashTable; % H is an associative array that maps keys to values foreach word w in content do H[w] H[w] + 1 foreach word w in H do EmitIntermediate(w,H[w]) Sabeur Aridhi Frameworks for Big Data Analytics 24 / 59

29 MapReduce Programming model Combiner and Partitioner Combiner Operates between Mapper and Reducer. Combiner : (k 2, list(v 2 )) (k 2, list(v 3 )) The combiner is applied before the global sort. Just like in the WordCount example. Partitioner Before global sort, decide which reducer receives which key. Sabeur Aridhi Frameworks for Big Data Analytics 25 / 59

30 MapReduce Programming model Complete MapReduce flow Sabeur Aridhi Frameworks for Big Data Analytics 26 / 59

31 MapReduce Programming model Fault tolerance management In case of a worker failure map (and not reduce) tasks completed reset to idle state map and reduce tasks in progress reset to idle Notification to and re-execution by workers executing reduce tasks in case of Master failure Version of 2004: Abort the computation and retry In MRv2 (YARN): Periodic check points of the data structures Launching of a new copy from the check point Sabeur Aridhi Frameworks for Big Data Analytics 27 / 59

32 MapReduce Framework Framework Google s MapReduce framework is not freely available Open source alternative: Apache s Hadoop Offered as service in Amazon Elastic Computing (via virtual machines) Sabeur Aridhi Frameworks for Big Data Analytics 28 / 59

33 MapReduce Framework Hadoop Architecture Sabeur Aridhi Frameworks for Big Data Analytics 29 / 59

34 MapReduce Framework Hadoop Details Library in Java Started by Yahoo in 2006 By default, Map and Reduce functions must be written in Java Possibility to use external programs as mapper and reducers Sabeur Aridhi Frameworks for Big Data Analytics 30 / 59

35 MapReduce Algorithms Inverted Index (Job interview question ;) Problem We are given a set of webpages P = {P 1,..., P n }. We want to know for each word in the dataset, in which documents it appears (and how many times) Sabeur Aridhi Frameworks for Big Data Analytics 31 / 59

36 MapReduce Algorithms Reminder of MapReduce flow Sabeur Aridhi Frameworks for Big Data Analytics 32 / 59

37 MapReduce Algorithms Baseline solution Algorithm: Map function(string filename, String content) H new HashTable; foreach word w in content do H[w] H[w] + 1 foreach word w in H do EmitIntermediate(w,(f ilename, H[w])) Algorithm: Reduce function(string key, Iterator values) result []; foreach v in values do result.add(v); result.sortbyf req(); Emit (key,result); Sabeur Aridhi Frameworks for Big Data Analytics 33 / 59

38 MapReduce Algorithms Simple run Sabeur Aridhi Frameworks for Big Data Analytics 34 / 59

39 MapReduce Algorithms Simple run Sabeur Aridhi Frameworks for Big Data Analytics 34 / 59

40 MapReduce Algorithms K-Means clustering Problem definition Given a set of points, partition them to minimize the within-cluster sum of squares Standard algorithm (LLoyd 1982) Choose K centroids (m 1, m 2,..., m k ) at random While changes: Assign each point to closest centroid Move each centroid to the average of the points assigned to it Sabeur Aridhi Frameworks for Big Data Analytics 35 / 59

41 MapReduce Algorithms K-Means run m2 m1 m3 Sabeur Aridhi Frameworks for Big Data Analytics 36 / 59

42 MapReduce Algorithms K-Means run Sabeur Aridhi Frameworks for Big Data Analytics 36 / 59

46 MapReduce Algorithms MapReduce Algorithm K-Means is an iterative algorithm Each MapReduce job corresponds to one iteration of K-Means MapReduce implementation of K-Means Let C be the set of starting centroids Do: Set C = C Send C to all nodes Run MapReduce algorithm Extract new centroids C from output While C C Sabeur Aridhi Frameworks for Big Data Analytics 37 / 59

47 MapReduce Algorithms MapReduce Algorithm Algorithm: Map function Input: Int id, Point p k argmin i (dist(p, C i )) ; EmitIntermediate(k,p); Algorithm: Reduce function Input: centroid id, Iterator points result avg(points); Emit (centroid id,result); Sabeur Aridhi Frameworks for Big Data Analytics 38 / 59

48 MapReduce Algorithms MapReduce: Use and abuse Hadoop horror stories (from Yahoo) Map tasks that take a full machine for 3 days Jobs that have 10 MB of data per task, but 1 GB of data in distributed cache What are the problems? Iterative data-flows are not optimized (have to write to disk in every iteration!) No persistent data structure Sabeur Aridhi Frameworks for Big Data Analytics 39 / 59

49 Other frameworks Contents 1 Introduction Big Data Parallel/Distributed computing Distributed storage Grid Computing Cloud Computing 2 MapReduce Programming model Framework Algorithms 3 Other frameworks SPARK STORM 4 Graph Processing Motivations Pregel model Sample Algorithms: SSSP Available implementations Blogel 5 Q and A Sabeur Aridhi Frameworks for Big Data Analytics 40 / 59

50 Other frameworks SPARK SPARK SPARK framework A general engine for large-scale data processing. It supports SQL, streaming, and a wide range of functions. It offers several high-level operators that make it easy to build parallel applications. Sabeur Aridhi Frameworks for Big Data Analytics 41 / 59

51 Other frameworks SPARK SPARK Spark abstracts the distributed storage of objects via resilient distributed datasets (RDDs). Immutable collections of objects spread across a cluster. Built through parallel transformations (map, filter, etc). Automatically rebuilt on failure. Controllable persistence (e.g. caching in RAM) for reuse. hared variables that can be used in parallel operations. Sabeur Aridhi Frameworks for Big Data Analytics 42 / 59

52 Other frameworks STORM STORM Background Created by Nathan BackType/Twitter. Analyse graphs (users and links between them) Open sourced at September Used by many companies in Big Data industry Twitter; Alibaba; Groupon;... Sabeur Aridhi Frameworks for Big Data Analytics 43 / 59

53 Other frameworks STORM STORM Scalable and robust. Allows online computation and efficient processing of streaming data. Based on distributed Remote Procedure Call (RPC). Fault tolerant. Sabeur Aridhi Frameworks for Big Data Analytics 44 / 59

54 Other frameworks STORM System architecture System architecture Nimbus: Like JobTacker in hadoop Supervisor: Manage workers Zookeeper: Store meta data (e.g., file locations,... ) UI: Web-UI Sabeur Aridhi Frameworks for Big Data Analytics 45 / 59

55 Graph Processing Contents 1 Introduction Big Data Parallel/Distributed computing Distributed storage Grid Computing Cloud Computing 2 MapReduce Programming model Framework Algorithms 3 Other frameworks SPARK STORM 4 Graph Processing Motivations Pregel model Sample Algorithms: SSSP Available implementations Blogel 5 Q and A Sabeur Aridhi Frameworks for Big Data Analytics 46 / 59

56 Graph Processing Motivations Motivations Many big data sets are graphs: Internet crawl data Social network data Bioinformatics (protein folding graphs,... ) Sabeur Aridhi Frameworks for Big Data Analytics 47 / 59

57 Graph Processing Motivations Applications for graph data Application domains Computer networks, Social networks, Bioinformatics, Chemoinformatics. Graph representation Data modeling. Identifying relationship patterns and rules. Protein structure Chemical compound Social network Sabeur Aridhi Frameworks for Big Data Analytics 48 / 59

58 Graph Processing Pregel model Pregel - Basic Facts Developed at Google. A Pregel program is based on a sequence of supersteps Each vertex in the graph is seen as a different process At each superstep, each vertex: Receives the list of messages sent to it during the last superstep Computes its new state Sends messages to other vertices Decides if it votes to halt When all nodes vote to halt, the algorithm stops. Sabeur Aridhi Frameworks for Big Data Analytics 49 / 59

59 Graph Processing Pregel model Framework of Pregel Each process take control of a part of the graph The vertices are divided between the nodes The order of computation inside each superstep is not important Can be built on top of MapReduce or other frameworks Sabeur Aridhi Frameworks for Big Data Analytics 50 / 59

60 Graph Processing Pregel model Single Source Shortest Path (SSSP) Single source shortest path Given a weighted graph and a source vertex s, compute the distance from s to all other vertices. Centralized solution Run a breadth-first search. Sabeur Aridhi Frameworks for Big Data Analytics 51 / 59

61 Graph Processing Pregel model Pregel solution Algorithm: SSSPVertex integer dist (this = source)? 0 : + ; compute(messages) integer newdist min(messages); if newdist < dist then dist newdist; for Edge e this.getneighbors() do sendmessage(e, dist + e.value()); else votetohalt(); Sabeur Aridhi Frameworks for Big Data Analytics 52 / 59

62 Graph Processing Pregel model Shortest Example: path SSSP Parallel BFS in Pregel Sabeur Aridhi Frameworks for Big Data Analytics 53 / 59

71 Graph Processing Pregel model Available implementation Two implementations: GoldenOrb and Apache Giraph Based on Apache Hadoop GoldenOrb Developed by an independent company, sponsored by Ravel (Austin TX) Launched June 2011 Apache Giraph Developed in Apache Incubator public c l a s s MyVertex extends Vertex<IntWritable, IntWritable, IntMessage >{... public void compute ( Collection <IntMessage> messages ) {... sendmessage (new IntMessage ( dest, new IntWritable ( 4 2 ) ) ) ; } } Sabeur Aridhi Frameworks for Big Data Analytics 54 / 59

72 Graph Processing Blogel Blogel - Programming Model Blogel is a block-centric framework for distributed computation on large graphs. From Think Like a Vertex (Pregel) to Think Like a Block (Blogel) A block refers to a connected subgraph of the graph. Message exchanges occur among blocks. Sabeur Aridhi Frameworks for Big Data Analytics 55 / 59

73 Graph Processing Blogel Blogel Three types of jobs: 1 Vertex-centric graph computing, where a worker is called a V-worker. 2 Graph partitioning which groups vertices into blocks, where a worker is called a partitioner. 3 Block-centric graph computing, where a worker is called a B-worker. Sabeur Aridhi Frameworks for Big Data Analytics 56 / 59

74 Graph Processing Blogel Blogel - Programming Model Blogel operates in three computing modes, depending on the application. B-mode: Only block-level message exchanges are allowed. A job terminates when all blocks voted to halt and there is no pending message for the next superstep. V-mode: Only vertex-level message exchanges are allowed. A job terminates when all vertices voted to halt and there is no pending message for the next superstep. VB-mode: vertex-level message exchanges and block-level message exchanges are allowed. Sabeur Aridhi Frameworks for Big Data Analytics 57 / 59

75 Q and A Contents 1 Introduction Big Data Parallel/Distributed computing Distributed storage Grid Computing Cloud Computing 2 MapReduce Programming model Framework Algorithms 3 Other frameworks SPARK STORM 4 Graph Processing Motivations Pregel model Sample Algorithms: SSSP Available implementations Blogel 5 Q and A Sabeur Aridhi Frameworks for Big Data Analytics 58 / 59

76 Q and A That s it. Why was this important? Questions? [email protected] Sabeur Aridhi Frameworks for Big Data Analytics 59 / 59