Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Size: px

Start display at page:

Download "Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15"

Kelley Jackson
10 years ago
Views:

1 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris Zeinalipour

(Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et.

2 EPL646: Part Β Distributed/Web/Cloud DBs/Dstores Lecture Focus (OLTP) (OLAP) Venn Diagram by 451 group

Venn Diagram by 451 group http://xeround.

3 Lecture Outline Introduction to "Big-Data" Analytics Example Scenarios and Architectures. Map-Reduce Programming Model Microsoft's Dryad Programming Model Map-Reduce Counting Problem Map-Reduce Architecture Hadoop JobTracker, Tasktrackers and data-nodes Failure Management Map-Reduce Optimizations Combiners, Compression, In-Memory Shuffling, Speculative Execution Programming Map-Reduce With Languages, PIG and in-the-cloud 15-3

Architecture Hadoop JobTracker, Tasktrackers and data-nodes Failure Management Map-Reduce Optimizations

4 Big-data Analytics 15-4

5 Big-data Analytics (Example) We have a large file of words, one word to a line. e.g., analyze web server logs for popular URLs Count the number of times each distinct word appears in the file i.e., sort datafile uniq c Scenario captures essence of MapReduce Great thing is it is naturally parallelizable! 15-5

6 Big-data Analytics 15-6

7 Map-Reduce Programming Model 15-7

8 Map-Reduce Programming Model Another Model: Microsoft Dryad! programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center Higher Level Language: DryadLINQ 15-8

9 Dryad Programming Model Microsoft Dryad! programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center Higher Level Language: DryadLINQ (PIG counterpart) Let us get back to Map-Reduce! 15-9

from a small cluster to a large data-center http://research.microsoft.

10 Map-Reduce Problem Count the distinct words in all documents cat *.txt sort uniq -c 1 TB on 1 PC = 2 hours!!! 1TB on 100 PCs = 1min!!! 15-10

11 Map-Reduce Example Example uses 1 mapper / 1 reduce only! M a p S hu ffl e R e d u c e 15-11

12 Map-Reduce Programming Model (dumping) (hashing / sorting) (grouping) 15-12

13 Map-Reduce Architecture (e.g., in Hadoop) HFDS blocks (64MB containing documents) Standard Output (e.g., socket) Hashing HDFS Reading Local Shuffling (of terms) Remote Write (e.g., Socket) HDFS Writing 15-13

14 Map-Reduce Architecture (e.g., in Hadoop) 15-14

15 Map-Reduce Architecture (Processing Remarks) 15-15

16 Map-Reduce Architecture (Failure Management) "ZooKeeper: Wait-free coordination for Internet-scale systems", Hunt et al., USENIX 2010,

17 Map-Reduce Optimizations (Combiners) * Distributive: COUNT, MIN, MAX, SUM AVG, STDDEV (Algebraic) and MEDIAN, RANK (Holistic, all are necessary) 15-18

18 Map-Reduce Optimizations (Compression) * 15-19

19 Map-Reduce Optimizations (Shuffling in Memory) * 15-20

20 Map-Reduce Optimizations (Speculative Execution) * 15-21

21 MapReduce in Hadoop (MR => HADOOP => HBASE) Map-Reduce: a programming model for processing large data sets. Invented by Google! "MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation,San Francisco, CA, December, 2004." Can be implemented in any language (recall javascript Map- Reduce we used in the context of CouchDB). Hadoop: Apache's open-source software framework that supports data-intensive distributed applications Derived from Google's MapReduce + Google File System (GFS) papers. (Input by Yahoo!, Facebook, etc.) Enables applications to work with thousands of computationindependent computers and petabytes of data. Download:

22 MapReduce in Hadoop (Who is driving Hadoop?) 15-23

23 MapReduce in Hadoop (MR => HADOOP => HBASE) Hadoop Project Modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS ): A distributed file system that provides highthroughput access to application data. Hadoop YARN (Yet Another Resource Negotiator): A framework for job scheduling and cluster resource management. Hadoop MapReduce (MapReduce v2.0): A YARN-based system for parallel processing of large data sets. Other Hadoop-related projects at Apache include: Avro : A data serialization system. Cassandra : A scalable multi-master database with no single points of failure. Chukwa : A data collection system for managing large distributed systems. HBase (Hadoop Database): A scalable, distributed database that supports structured data storage for large tables. (Next Lectures) Hive : A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout : A Scalable machine learning and data mining library. Pig : A high-level data-flow language and execution framework for parallel computation. (Next Lectures) ZooKeeper : A high-performance coordination service for distributed applications

24 Programming with Hadoop (with Languages) * Our Focus! 15-25

25 Programming with Hadoop (in the Cloud!) * Our Focus! 15-26

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul