Introduction to MapReduce Tore Risch Information Technology Uppsala University

Similar documents
Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Big Data and Apache Hadoop s MapReduce

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Lecture 10 - Functional programming: Hadoop and MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Hadoop

Lecture Data Warehouse Systems

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch September 16,

Introduction To Hive

Implement Hadoop jobs to extract business value from large and varied data sets

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Hadoop and NoSQL Basics: Big Data Demystified. NYS Innovation Summit, 12/17/2013. Matt

Map Reduce & Hadoop Recommended Text:

Sentimental Analysis using Hadoop Phase 2: Week 2

The MapReduce Framework

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

How To Create A Large Data Storage System

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Introduction to Apache Hive

Open Source Technologies on Microsoft Azure

Cloudera Certified Developer for Apache Hadoop

Qsoft Inc

COURSE CONTENT Big Data and Hadoop Training

NoSQL Roadshow Berlin Kai Spichale

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Data and Scripting Systems build on top of Hadoop

The Hadoop Eco System Shanghai Data Science Meetup

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

How To Handle Big Data With A Data Scientist

MapReduce (in the cloud)

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Cloud Scale Distributed Data Storage. Jürmo Mehine

Can the Elephants Handle the NoSQL Onslaught?

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Hadoop and Big Data Research

Getting Started with Hadoop with Amazon s Elastic MapReduce

American International Journal of Research in Science, Technology, Engineering & Mathematics

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Apache HBase. Crazy dances on the elephant back

Databases 2 (VU) ( )

Apache Hadoop. Alexandru Costan

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

MAPREDUCE Programming Model

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Hadoop Distributed File System. -Kishan Patel ID#

Introduction to Apache Hive

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting Systems build on top of Hadoop

Challenges for Data Driven Systems

Big Data Weather Analytics Using Hadoop

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Big Data Hive! Laurent d Orazio

NoSQL and Hadoop Technologies On Oracle Cloud

Log Mining Based on Hadoop s Map and Reduce Technique

An Approach to Implement Map Reduce with NoSQL Databases

Hadoop implementation of MapReduce computational model. Ján Vaňo

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

24/11/14. During this course. Internet is everywhere. Frequency barrier hit. Management costs increase. Advanced Distributed Systems Cloud Computing

A programming model in Cloud: MapReduce

Advanced Data Management Technologies

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Apache Flink Next-gen data analysis. Kostas

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece

CS 378 Big Data Programming

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

INTRODUCTION TO CASSANDRA

Cloud data store services and NoSQL databases. Ricardo Vilaça Universidade do Minho Portugal

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Introduction to Big Data Training

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Introduction to Hadoop

Workshop on Hadoop with Big Data

Big Data Technologies Compared June 2014

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Infrastructures for big data

Unified Big Data Analytics Pipeline. 连 城

CSE-E5430 Scalable Cloud Computing Lecture 2

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Big Data and the Cloud Trends, Applications, and Training

CS54100: Database Systems

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

ITG Software Engineering

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Hadoop Job Oriented Training Agenda

Turn Big Data to Small Data

Integrating Big Data into the Computing Curricula

Toward Lightweight Transparent Data Middleware in Support of Document Stores

Transcription:

Introduction to MapReduce http://user.it.uu.se/~torer/kurser/dm2/mapreduce.pdf Tore Risch Information Technology Uppsala University 2015-05-11

What is a NoSQL Database? A key/value store Basic index manager, no complete query language E.g. Google BigTable, Amazon Dynamo A DBMS with a limited query language Provides for high volume small business transactions Sometimes called cloud databases E.g. Google App Engine, Microsoft Azure, Amazon SimpleDB; MongoDB A DBMS with a new query language for new kinds of applications Amos II, Streambase, Virtuoso, Neo4J

What is a NoSQL Database? A batch database processing system where mapreduce is used instead of queries to query, transform, and alalyze large datasets Manual programs iterate over entire data sets in batch E.g. Hadoop, Spark A mapreduce engine with a query language on top: HIVE on top of Hadoop provides HiveQL Provides non-procedural data analythics (select - from - groupby) without detailed programming Executed in batch as parallel Hadoop jobs

MapReduce Parallel batch processing using mapreduce Highly scalable implementation of parallell batch processing of same (e.g. Java, C++, Python) program over large amounts of data stored in different files Based on a scalable file system (e.g. HDFS) The mapreduce function: Applies a (costly) user function mapper producing key/value pairs in parallel on many nodes accessing files in a cluster Applies a user aggregate function on the key/value pairs produced by the mapper Very similar to GROUP BY in SQL Reference article: http://dl.acm.org/citation.cfm?id=1327452.1327492

File I/O Mapreduce Data file Map Partition Data file Map Partition Reduce Data file Data file Map Map Partition Partition Reduce Output Writer Result file Data file Map Partition Reduce Data file Map Partition

Mapreduce code (Python) function map(string name, String document): // name: document name, i.e. HDFS file contents // document: document contents, parsed HDFS file tokens // Can make own parser as preprocessor for each word w in document: emit (w, 1) function reduce(string word, Iterator partialcounts): // word: a word // partialcounts pc: a list of aggregated partial counts (word, cnt) sum = 0; for each pc in partialcounts: sum += ParseInt(pc); emit (word, sum)

Input reader Mapreduce stages System component that reads files from scalable file system (e.g. HDFS) and sends to map functions applied in parallell Map function Applied in parallel on many different files Reads input file data from HDFS Does some (expensive) computation Emits key/value pairs as result Key/value pairs stored by MapReduce system as temporary files Partition function (optional) Partitions output key/value pairs from map function into groups of key/value pairs to be reduced in parallel Usually hash partitioning Reduce function Iterates over set of key/value pairs to produced a reduced set of key value pairs stored in the file system C.f. aggregate functions

HIVE SQL-like query language HiveQL support on top of Hadoop Developed by Facebook, now maintained by Netflix Queries very often involving grouping and statistics (as in SQL) Queries compiled into mapreduce jobs Handles 80% of mapreduce analysis tasks at Facebook Much easier to use compared to raw mapreduce coding Substantial impovement of programming productivity Non-programmers can analyze data

Hive architecture

Wordcount in HiveQL FROM ( MAP docs.doctext USING 'map.py' AS (word, cnt) FROM docs CLUSTER BY word ) REDUCE word, cnt USING 'reduce.py';

HIVQL example 2 Log file: 2012-02-03 20:26:41 SampleClass3 [TRACE] verbose detail for id 1527353937 2012-02-03 20:26:41 SampleClass2 [TRACE] verbose detail for id 191364434 Java.lang.Exception: 2012-02-03 20:27:10 SampleClass7 [ERROR] incorrect format for id 8276745 2012-02-03 20:29:38 SampleClass1 [DEBUG] detail for id 903114158 HIVEQL: CREATE TABLE logs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '; LOAD DATA LOCAL INPATH 'sample.log' OVERWRITE INTO TABLE logs; SELECT t4 AS sev, COUNT(*) AS cnt FROM logs WHERE t4 LIKE '[%' GROUP BY t4; [TRACE] 2 [DEBUG] 1

HIVQL example 3 Log file: 2012-02-03 20:26:41 SampleClass3 [TRACE] verbose detail for id 1527353937 2012-02-03 20:26:41 SampleClass2 [TRACE] verbose detail for id 191364434 Java.lang.Exception: 2012-02-03 20:27:10 SampleClass7 [ERROR] incorrect format for id 8276745 2012-02-03 20:29:38 SampleClass1 [DEBUG] detail for id 903114158 HIVEQL: SELECT t5 AS sev, COUNT(*) AS cnt FROM logs WHERE t5 LIKE '[%' GROUP BY t5; [ERROR] 1 SELECT L.sev, SUM(L.cnt) FROM ( SELECT t4 AS sev, COUNT(*) AS cnt FROM logs WHERE t4 LIKE '[%' GROUP BY t4 UNION ALL SELECT t5 AS sev, COUNT(*) AS cnt FROM logs WHERE t5 LIKE '[%' GROUP BY t5) L GROUP BY L.sev; [TRACE] 2 [DEBUG] 1 [ERROR] 1 => 3 Mapreduce jobs!

Word count in SQL Alt 1, assume words on documents stored in table: CREATE TABLE docs(integer ID, word VARCHAR(20) PRIMARY KEY(ID)) The query becomes: SELECT d.word, COUNT(d.word) FROM docs d GROUP BY d.word Problem: DOCS is table, not stored in file Alt 2, use user defined table function (UDF) to access documents in file: SELECT d.word, COUNT(d.word) FROM mydocuments( C:/mydocuments ) AS d GROUP BY d.word

HIVEQL vs raw mapreduce Raw mapreduce: Java (Python, C++, etc.) program does map and reduce Very common use of mapreduce: Statistics collection over files (count, sum, stdev, etc) HIVEQQL handles basic statistics 80% of applications When advanced statistics not supported in HIVEQL (or SQL): Alt 1: User defined aggregate functions in HIVE (UDAF) Can be generally used in other queries too Alt 2: Raw mapreduce Code may be complicated Code cannot be used in queries

Reading raw data files to DBMS Major point with mapreduce: Saves time to load database and build index compared to DBMS Modern DBMSs have bulk load facilities Never use insert command to bulk load Bulk load speed approaches file copy time Orders of magnitude faster than naïve inserts Automatic parallelization of bulk loading by DBMS Indexes can be built if needed

DBMSs vs mapreduce Modern DBMSs have built-in aggregate functions and user defined functions, UDFs, too Can read data from files using UDFs DBMSs have indexing and high parallelism to provide scalability Notice that it takes time to build index during database loading => May slow down database loading considerable When is HIVE with mapreduce better than RDBs? One shot queries when no indexing is needed Massively parallel very expensive brute force computations embarrasingly parallel computations