Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15



Similar documents
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Chase Wu New Jersey Ins0tute of Technology

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop. Sunday, November 25, 12

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Application Development. A Paradigm Shift

Big Data and Apache Hadoop s MapReduce

Accelerating and Simplifying Apache

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

CSE-E5430 Scalable Cloud Computing Lecture 2

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware


How To Scale Out Of A Nosql Database

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Large scale processing using Hadoop. Ján Vaňo

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Open source Google-style large scale data analysis with Hadoop

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

BBM467 Data Intensive ApplicaAons

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Open source large scale distributed data management with Google s MapReduce and Bigtable

MapReduce with Apache Hadoop Analysing Big Data

Big Data With Hadoop

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Data Management and NoSQL Databases

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Hadoop

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Hadoop Ecosystem B Y R A H I M A.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Dominik Wagenknecht Accenture

Apache Hadoop: Past, Present, and Future

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Map Reduce & Hadoop Recommended Text:

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Constructing a Data Lake: Hadoop and Oracle Database United!

BIG DATA TRENDS AND TECHNOLOGIES

Apriori-Map/Reduce Algorithm

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster

ITG Software Engineering

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Introduction to Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

BIG DATA HADOOP TRAINING

MapReduce and Hadoop Distributed File System

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

MapReduce, Hadoop and Amazon AWS

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Chapter 7. Using Hadoop Cluster and MapReduce

Introduction to MapReduce and Hadoop

MapReduce. Introduction and Hadoop Overview. 13 June Lab Course: Databases & Cloud Computing SS 2012

Complete Java Classes Hadoop Syllabus Contact No:

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Workshop on Hadoop with Big Data

Parallel Processing of cluster by Map Reduce

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Yuji Shirasaki (JVO NAOJ)

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Apache Hadoop Ecosystem

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Apache Hadoop. Alexandru Costan

BIG DATA What it is and how to use?

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Qsoft Inc

Transcription:

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris Zeinalipour http://www.cs.ucy.ac.cy/~dzeina/courses/epl646 15-1

EPL646: Part Β Distributed/Web/Cloud DBs/Dstores Lecture Focus (OLTP) (OLAP) Venn Diagram by 451 group http://xeround.com/blog/2011/04/newsql-cloud-database-as-a-service 15-2

Lecture Outline Introduction to "Big-Data" Analytics Example Scenarios and Architectures. Map-Reduce Programming Model Microsoft's Dryad Programming Model Map-Reduce Counting Problem Map-Reduce Architecture Hadoop JobTracker, Tasktrackers and data-nodes Failure Management Map-Reduce Optimizations Combiners, Compression, In-Memory Shuffling, Speculative Execution Programming Map-Reduce With Languages, PIG and in-the-cloud 15-3

Big-data Analytics 15-4

Big-data Analytics (Example) We have a large file of words, one word to a line. e.g., analyze web server logs for popular URLs 154.16.20.4 14.16.20.4 154.16.20.4 11.23.54.11 Count the number of times each distinct word appears in the file i.e., sort datafile uniq c 154.16.20.4 2 14.16.20.4 1 11.23.54.11 1 Scenario captures essence of MapReduce Great thing is it is naturally parallelizable! 15-5

Big-data Analytics 15-6

Map-Reduce Programming Model 15-7

Map-Reduce Programming Model Another Model: Microsoft Dryad! programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center Higher Level Language: DryadLINQ 15-8

Dryad Programming Model Microsoft Dryad! programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center http://research.microsoft.com/en-us/projects/dryad/ Higher Level Language: DryadLINQ (PIG counterpart) Let us get back to Map-Reduce! 15-9

Map-Reduce Problem Count the distinct words in all documents cat *.txt sort uniq -c 1 TB on 1 PC = 2 hours!!! 1TB on 100 PCs = 1min!!! 15-10

Map-Reduce Example Example uses 1 mapper / 1 reduce only! M a p S hu ffl e R e d u c e 15-11

Map-Reduce Programming Model (dumping) (hashing / sorting) (grouping) 15-12

Map-Reduce Architecture (e.g., in Hadoop) HFDS blocks (64MB containing documents) Standard Output (e.g., socket) Hashing HDFS Reading Local Shuffling (of terms) Remote Write (e.g., Socket) HDFS Writing 15-13

Map-Reduce Architecture (e.g., in Hadoop) 15-14

Map-Reduce Architecture (Processing Remarks) 15-15

Map-Reduce Architecture (Failure Management) "ZooKeeper: Wait-free coordination for Internet-scale systems", Hunt et al., USENIX 2010, http://static.usenix.org/event/usenix10/tech/full_papers/hunt.pdf 15-17

Map-Reduce Optimizations (Combiners) * Distributive: COUNT, MIN, MAX, SUM AVG, STDDEV (Algebraic) and MEDIAN, RANK (Holistic, all are necessary) 15-18

Map-Reduce Optimizations (Compression) * 15-19

Map-Reduce Optimizations (Shuffling in Memory) * 15-20

Map-Reduce Optimizations (Speculative Execution) * 15-21

MapReduce in Hadoop (MR => HADOOP => HBASE) Map-Reduce: a programming model for processing large data sets. Invented by Google! "MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation,San Francisco, CA, December, 2004." Can be implemented in any language (recall javascript Map- Reduce we used in the context of CouchDB). Hadoop: Apache's open-source software framework that supports data-intensive distributed applications Derived from Google's MapReduce + Google File System (GFS) papers. (Input by Yahoo!, Facebook, etc.) Enables applications to work with thousands of computationindependent computers and petabytes of data. Download: http://hadoop.apache.org/ 15-22

MapReduce in Hadoop (Who is driving Hadoop?) 15-23

MapReduce in Hadoop (MR => HADOOP => HBASE) Hadoop Project Modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS ): A distributed file system that provides highthroughput access to application data. Hadoop YARN (Yet Another Resource Negotiator): A framework for job scheduling and cluster resource management. Hadoop MapReduce (MapReduce v2.0): A YARN-based system for parallel processing of large data sets. Other Hadoop-related projects at Apache include: Avro : A data serialization system. Cassandra : A scalable multi-master database with no single points of failure. Chukwa : A data collection system for managing large distributed systems. HBase (Hadoop Database): A scalable, distributed database that supports structured data storage for large tables. (Next Lectures) Hive : A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout : A Scalable machine learning and data mining library. Pig : A high-level data-flow language and execution framework for parallel computation. (Next Lectures) ZooKeeper : A high-performance coordination service for distributed applications. 15-24

Programming with Hadoop (with Languages) * Our Focus! 15-25

Programming with Hadoop (in the Cloud!) * Our Focus! 15-26