Data Management Using MapReduce

Size: px
Start display at page:

Download "Data Management Using MapReduce"

Transcription

1 Data Management Using MapReduce M. Tamer Özsu University of Waterloo CS742-Distributed & Parallel DBMS M. Tamer Özsu 1 / 24

2 Basics For data analysis of very large data sets Highly dynamic, irregular, schemaless, etc. SQL too heavy Embarrassingly parallel problems New, simple parallel programming model Data structured as (key, value) pairs E.g. (doc-id, content), (word, count), etc. Functional programming style with two functions to be given: Map(k1,v1) list(k2,v2) Reduce(k2, list (v2)) list(v3) Implemented on a distributed file system (e.g., Google File System) on very large clusters CS742-Distributed & Parallel DBMS M. Tamer Özsu 2 / 24

3 Map Function User-defined function Processes input key/value pairs Produces a set of intermediate key/value pairs Map function I/O Input: read a chunk from distributed file system (DFS) Output: Write to intermediate file on local disk MapReduce library Executes function Groups together all intermediate values with the same key (i.e., generates a set of lists) Passes these lists to reduce functions Effect of function Processes and partitions input data Builds a distributed (trnasparent to user) Similar to group by operaton in SQL CS742-Distributed & Parallel DBMS M. Tamer Özsu 3 / 24

4 Reduce Function User-defined function Accepts one intermediate key and a set of values for that key (i.e., a list) Merges these values together to form a (possibly) smaller set Typically, zero or one output value is generated per invocation Reduce function I/O Input: read from intermediate files using remote reads on local files of corresponding per nodes Output: Each reducees writes its output as a file back to DFS Effect of function Similar to aggregation operaton in SQL CS742-Distributed & Parallel DBMS M. Tamer Özsu 4 / 24

5 MapReduce Processing Input data set. Map Map Map (k 1, v) (k 2, v) (k 2, v) (k 2, v) (k 1, v) Group by k Group by k (k 1, (v, v, v)) Reduce (k 1, (v, v, v, v)) Reduce Output data set Map (k 1, v) (k 2, v) CS742-Distributed & Parallel DBMS M. Tamer Özsu 5 / 24

6 Example 1 Assume you are reading the monthly average temperatures for each of the 12 months of a year for a bunch of cities, i.e., each input is the name of the city (key) and the average monthly temperature. Compute the average annual temperature for each city. Map: Input: City, Month, MonthAvgTemp 1. Create key/value pairs Output: City, MonthAvgTemp Reduce: Input: City, MonthAvgTemp 1. Sort to get City, list(monthavgtemp) (i.e., it combines in a list the monthly average temperatures for a given city) 2. Compute average over list(monthavgtemp) Output: City, AnnualAvgTemp CS742-Distributed & Parallel DBMS M. Tamer Özsu 6 / 24

7 Example 2 Consider EMP (ENAME, TITLE, CITY) Query: Map: SELECT CITY, COUNT( ) FROM EMP WHERE ENAME LIKE \%Smith GROUP BY CITY I n p u t : ( TID, emp ), Output : ( CITY, 1 ) i f emp.ename l i k e \%Smith return ( CITY, 1 ) Reduce: I n p u t : ( CITY, l i s t ( 1 ) ), Output : ( CITY,SUM( l i s t ( 1 ) ) ) return ( CITY,SUM( 1 ) ) CS742-Distributed & Parallel DBMS M. Tamer Özsu 7 / 24

8 MapReduce Architecture Master Worker Map Process Input Module Map Module Combine Module Partition Module Worker Map Process Input Module Map Module Combine Module Partition Module Worker Map Process Input Module Map Module Combine Module Partition Module Scheduler Worker Reduce Process Group Module Reduce Module Output Module Worker Reduce Process Group Module Reduce Module Output Module CS742-Distributed & Parallel DBMS M. Tamer Özsu 8 / 24

9 MapReduce: Simplified Data Processing on Large Clusters Execution Flow with Architecture (1) fork User Program (1) fork (1) fork (2) assign Master (2) assign reduce worker split 0 split 1 split 2 (3) read (4) local write worker (5) remote (5) read worker (6) write output file 0 split 3 split 4 worker output file 1 worker Input files Map phasr Intermediate files (on local disks) Reduce phase Output files Fig. 1. Execution overview. From: J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Comm. ACM, CS742-Distributed & Parallel DBMS M. Tamer Özsu 9 / 24

10 Characteristics Flexibility User can write any and reduce function code No need to know how to parallelize Scalability Elastic scalability Automatic load balancing Efficiency Simple: parallel scan No database loading Fault tolerance Worker failure Master pings workers periodically; assumes failure if no response Tasks (both and reduce) on failed workers scheduled on a different worker node Master failure Checkpoints of master state Recovery after failure progress can halt Replication on distributed data store CS742-Distributed & Parallel DBMS M. Tamer Özsu 10 / 24

11 Hadoop Most popular MapReduce implementation developed by Yahoo! Two components Processing engine HDFS: Hadoop Distributed Storage System others possible Can be deployed on the same machine or on different machines Processes Job tracker: hosted on the master node and implements the schedule Task tracker: hosted on the worker nodes and accepts tasks from job tracker and executes them HDFS Name node: stores how data are partitioned, monitors the status of data nodes, and data dictionary Data node: Stores and manages data chunks assigned to it Worker 1 Name Node Worker n MapReduce Task Tracker Job Tracker Task Tracker HDFS Data Node Name Node Data Node CS742-Distributed & Parallel DBMS M. Tamer Özsu 11 / 24

12 Hadoop UDF Functions Phase Name Function InputFormat::getSplit Partition the input data into different splits. Each split is processed by a per Map and may consist of several chunks. RecordReader::next Define how a split is divided into items. Each item is a key/value pair and used as the input for the function. Mapper:: Users can customize the function to process the input data. The input data are transformed into some intermediate key/value pairs. WritableComparable::compareTo The comparison function for the key/- value pairs. Job::setCombinerClass Specify how the key/value pair are aggregated locally. Shuffle Job::setPartitionerClass Specify how the intermediate key/value pairs are shuffled to different reducers. Job::setGroupingComparatorClass Specify how the key/value pairs are Reduce grouped in the reduce phase. Reducer::reduce Users write their own reduce functions to perform the corresponding jobs. CS742-Distributed & Parallel DBMS M. Tamer Özsu 12 / 24

13 MapReduce Implementations Name Language File System Index Master Server Multiple Job Support Hadoop Java HDFS No Name Node and Job Yes Tracker Disco Python and Erlang Distributed Index Disco Server Skynet Ruby MySQL or No Any node in the cluster No Unix File System FileMap Shell Unix File No Any node in the cluster No and Perl Scripts System Twister Java Unix File No One master node in broker Yes System network Cascading Java HDFS No Name Node and Job Yes Tracker No No CS742-Distributed & Parallel DBMS M. Tamer Özsu 13 / 24

14 MapReduce Languages Hive Pig Language Declarative SQL-like Dataflow Data model Nested Nested UDF Supported Supported Data partition Supported Not supported Interface Command line, web, Command line JDBC/ODBC server Query optimization Rule based Rule based Metastore Supported Not supported CS742-Distributed & Parallel DBMS M. Tamer Özsu 14 / 24

15 MapReduce Implementations of Database Operators Select and Project can be easily implemented in the function Aggregation is not difficult (see next slide) Join requires more work MapReduce join implementations Repartition join θ-join Equi-join Semi-join Broadcast join Map-only join Similarity join Partition join Multiple MapReduce jobs Multi-way join Replicated join CS742-Distributed & Parallel DBMS M. Tamer Özsu 15 / 24

16 Aggregation Extracting aggregation attribute (Aid) Grouping by Aid Applying the aggregation function for the tuples with the same Aid Mapper 1 R 1 R1 2 R2 Mapper 2 R 3 R3 4 R4 Aid Value 1 R1 2 R2 Aid Value 1 R3 2 R4 Partitioning by Aid (Round Robin) Reducer 1 Aid 1 Reducer 2 Aid 2 Value R1 R3 Value R2 R4 reduce reduce Result 1, f(r1, R3) Result 2, f(r2, R4) Map Phase Reduce Phase CS742-Distributed & Parallel DBMS M. Tamer Özsu 16 / 24

17 θ-join Baseline implementation of R(A, B) S(B, C) 1. Partition R and assign each partition to pers 2. Each per takes a, b tuples and converts them to a list of key/value pairs of the form (b, a, R ) 3. Each reducer pulls the pairs with the same key 4. Each reducer joins tuples of R with tuples of S CS742-Distributed & Parallel DBMS M. Tamer Özsu 17 / 24

18 θ-join If θ equals = (i.e., equijoin) Repartition join Semijoin-based join Map-only join If θ equals Randomly assigning bucket ID (Bid) Grouping by Bid Aggregate tuples based on origins Local θ-join (θ is ) Mapper 1 R 1 R1 2 R2 3 R3 4 R4 Mapper 1 S 1 S1 2 S2 3 S3 4 S4 Bid Tuple 1 R,1,R1 4 R,2,R2 3 R,3,R3 2 R,2,R4 Bid Tuple 4 S,1,S1 2 S,2,S2 3 S,3,S3 1 S,2,S4 Partitioning by Bid (Round Robin) Reducer 1 Bid Tuple R,1,R1 1 S,4,S4 R,3,R3 3 S,3,S3 Reducer 2 reduce Origin Tuple R 1,R1 R 4,R4 Origin Tuple S 3,S3 S 4,S4 Result R1 S3 R1 S4 R4 S3 Map Phase Reduce Phase CS742-Distributed & Parallel DBMS M. Tamer Özsu 18 / 24

19 Repartition Join Tagging origins Grouping by keys Local join Mapper 1 R 1 R1 2 R2 3 R3 4 R4 Mapper 1 S 1 S1 2 S2 3 S3 4 S4 1 R,R1 2 R,R2 3 R,R3 4 R,R4 1 S,S1 2 S,S2 3 S,S3 4 S,S4 Partitioning by key (Round Robin) Reducer 1 Key 1 3 Reducer 2 Key 2 4 Tuple R,R1 S,S1 R,R3 S,S3 Tuple R,R2 S,S2 R,R4 S,S4 reduce reduce Result R1 S1 R3 S3 Result R2 S2 R4 S4 Map Phase Reduce Phase CS742-Distributed & Parallel DBMS M. Tamer Özsu 19 / 24

20 Semijoin-based Join Extracting join keys Broadcasting keys of R to all the splits of S and join S with keys of R Broadcasting the results of the previous job (S) to all the splits of R, and locally joining R with S R 1 R1 3 R2 1 R3 4 R4 MapReduce Key Mapper 1 Key S 1 S1 2 S2 Mapper 2 Key S 3 S3 4 S4 1 S1 3 S3 4 S4 Mapper 1 S 1 S1 3 S3 4 S4 R 1 R1 2 R2 Mapper 2 Result R1 S1 R2 S3 Job 1 Full MapReduce job Job 2 Map-only job Job 3 Map-only job CS742-Distributed & Parallel DBMS M. Tamer Özsu 20 / 24

21 Map-only Join Broadcast join: If inner relation outer relation no shuffling Map phase similar to third job of semijoin-based join Each per loads the full inner table to build an in-memory hash; scan the outer relation Trojan join: Relations are already co-partitioned on the join key all tuples of both relations are co-located on the same node Co-partitioning implemented by one job Scheduler loads co-partitioned data chunks in the same per to perform a local join Co-partitioned Split Co-group Co-group H R Data R H S Data S H R Data R H S Data S Footer CS742-Distributed & Parallel DBMS M. Tamer Özsu 21 / 24

22 Multiway Join Multiple MapReduce jobs R S T (R S) T Each join implemented in one MapReduce job Join ordering problem Replicated join Single MapReduce job CS742-Distributed & Parallel DBMS M. Tamer Özsu 22 / 24

23 Replicated Join Generating keys and tagging origins Joining the three tables locally Mapper 1 R Rid Value 1 C1 2 C2 Mapper 2 T Rid Sid Value 1 1 L1 1 2 L2 2 2 L3 Mapper 3 S Sid Value 1 O1 2 O2 1, null C,C1 2, null C,C2 1,1 L,L1 1,2 L,L2 2,2 L,L3 Key null,1 null,2 Value O,O1 O,O2 Partitioning by key (Round Robin) Shuffling tuples to multiple reducers if necessary Reducer 1 1,null C, C1 1,1 L, L1 null,1 O, O1 Reducer 2 1,null C, C1 1,2 L, L2 null,2 O, O2 Reducer 3 Key 2,null null,1 Reducer 4 Value C, C2 O, O1 2,null C, C2 2,2 L, L3 null,2 O, O2 reduce reduce reduce reduce Result C1,L1,O1 Result C1,L2,O2 Result Result C2,L2,O2 Map Phase Reduce Phase CS742-Distributed & Parallel DBMS M. Tamer Özsu 23 / 24

24 DBMS on MapReduce HadoopDB Llama Cheetah Language SQL-like Simple interface SQL Storage Row store Column store Hybrid store Data compression No Yes Yes Indexing Query optimization parti- Horizontally tioned Data partition Local index in each database instance Rule based optimization plus local optimization by PostgreSQL Vertically partitioned Horizontally partitioned at chunk level No index Local index for each data chunk Column-based optimization, Multi-query optimiza- late tion, materialized materialization and views processing multiway join in one job CS742-Distributed & Parallel DBMS M. Tamer Özsu 24 / 24

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

A Distributed Data Management Using MapReduce

A Distributed Data Management Using MapReduce A Distributed Data Management Using MapReduce FENG LI, National University of Singapore BENG CHIN OOI, National University of Singapore M. TAMER ÖZSU, University of Waterloo SAI WU, Zhejiang University

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Advanced Data Management Technologies

Advanced Data Management Technologies ADMT 2015/16 Unit 15 J. Gamper 1/53 Advanced Data Management Technologies Unit 15 MapReduce J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements: Much of the information

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

How To Use Hadoop

How To Use Hadoop Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

More information

Data Management in the Cloud -

Data Management in the Cloud - Data Management in the Cloud - current issues and research directions Patrick Valduriez Esther Pacitti DNAC Congress, Paris, nov. 2010 http://www.med-hoc-net-2010.org SOPHIA ANTIPOLIS - MÉDITERRANÉE Is

More information

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com Big Data Primer Alex Sverdlov alex@theparticle.com 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems

More information

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008 12+TB(compressed) raw data per day today Trends

More information

Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework

Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework Fariha Atta MSc Informatics School of Informatics University of Edinburgh 2010 T Abstract he Map/Reduce

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

MapReduce (in the cloud)

MapReduce (in the cloud) MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

More information

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

More information

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases Abinav Pothuganti Computer Science and Engineering, CBIT,Hyderabad, Telangana, India Abstract Today, we are surrounded by data like oxygen. The exponential

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

BIG DATA IN SCIENCE & EDUCATION

BIG DATA IN SCIENCE & EDUCATION BIG DATA IN SCIENCE & EDUCATION SURFsara Data & Computing Infrastructure Event, 12 March 2014 Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra WHY BIG DATA? 2 Source: Jimmy Lin & http://en.wikipedia.org/wiki/mount_everest

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12 Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language

More information

ITG Software Engineering

ITG Software Engineering Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Processing Joins over Big Data in MapReduce

Processing Joins over Big Data in MapReduce Processing Joins over Big Data in MapReduce Christos Doulkeridis Department of Digital Systems School of Information and Communication Technologies University of Piraeus http://www.ds.unipi.gr/cdoulk/

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

Solving Big Data Problem using Hadoop File System (HDFS)

Solving Big Data Problem using Hadoop File System (HDFS) International Journal of Applied Information Systems (IJAIS) ISSN : 2249-0868 Foundation of Computer Science FCS, New York, USA International Conference and Workshop on Communication, Computing and Virtualization

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759

More information

MapReduce for Data Warehouses

MapReduce for Data Warehouses MapReduce for Data Warehouses Data Warehouses: Hadoop and Relational Databases In an enterprise setting, a data warehouse serves as a vast repository of data, holding everything from sales transactions

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

Cloud Computing. Chapter 8. 8.1 Hadoop

Cloud Computing. Chapter 8. 8.1 Hadoop Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Big Data Technology CS 236620, Technion, Spring 2013

Big Data Technology CS 236620, Technion, Spring 2013 Big Data Technology CS 236620, Technion, Spring 2013 Structured Databases atop Map-Reduce Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Roadmap Previous class MR Implementation This class Query Languages

More information

Data Management in the Cloud

Data Management in the Cloud Data Management in the Cloud Ryan Stern stern@cs.colostate.edu : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004) MapReduce from the paper MapReduce: Simplified Data Processing on Large Clusters (2004) What it is MapReduce is a programming model and an associated implementation for processing and generating large

More information

Large Scale Text Analysis Using the Map/Reduce

Large Scale Text Analysis Using the Map/Reduce Large Scale Text Analysis Using the Map/Reduce Hierarchy David Buttler This work is performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract

More information

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016 Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible

More information

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents

More information

Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony

Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony Speaker logo centered below image Steve Kuo, Software Architect Joshua Tuberville, Software Architect Goal > Leverage EC2 and Hadoop to

More information

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding

More information

Complete Java Classes Hadoop Syllabus Contact No: 8888022204

Complete Java Classes Hadoop Syllabus Contact No: 8888022204 1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

More information

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010 Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information