CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Similar documents

Lecture Data Warehouse Systems

NoSQL Data Base Basics

Machine- Learning Summer School

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Advanced Data Management Technologies

Big Data With Hadoop

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Jeffrey D. Ullman slides. MapReduce for data intensive computing

CSE-E5430 Scalable Cloud Computing Lecture 2

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

NoSQL, Big Data, and all that

Cloud Scale Distributed Data Storage. Jürmo Mehine

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Hadoop IST 734 SS CHUNG

An Approach to Implement Map Reduce with NoSQL Databases

Open source large scale distributed data management with Google s MapReduce and Bigtable

Can the Elephants Handle the NoSQL Onslaught?

Google Bing Daytona Microsoft Research

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Introduction to NOSQL

NoSQL and Hadoop Technologies On Oracle Cloud

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Challenges for Data Driven Systems

Integrating Big Data into the Computing Curricula

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

MapReduce (in the cloud)

Cassandra A Decentralized Structured Storage System

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Cloud Computing at Google. Architecture

Big Data and Apache Hadoop s MapReduce

Hadoop and Map-Reduce. Swati Gore

Slave. Master. Research Scholar, Bharathiar University

NoSQL. Thomas Neumann 1 / 22

Hadoop Ecosystem B Y R A H I M A.

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Introduction to Hadoop

CS 378 Big Data Programming

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Map Reduce / Hadoop / HDFS

Infrastructures for big data

A programming model in Cloud: MapReduce

How To Scale Out Of A Nosql Database

Introduction to Hadoop

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

CS 378 Big Data Programming. Lecture 2 Map- Reduce

BIG DATA What it is and how to use?

Comparing Scalable NOSQL Databases

Data-Intensive Computing with Map-Reduce and Hadoop

Big Data Processing with Google s MapReduce. Alexandru Costan

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Big Data Course Highlights

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

nosql and Non Relational Databases

Apache Hadoop. Alexandru Costan

Big Systems, Big Data

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data Workshop. dattamsha.com

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Introduction to Hadoop

Databases 2 (VU) ( )

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch September 16,

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Preparing Your Data For Cloud

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Applications for Big Data Analytics

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Open Source Technologies on Microsoft Azure

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

NoSQL in der Cloud Why? Andreas Hartmann

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch September 30,

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Advanced Data Management Technologies

NOSQL DATABASES AND CASSANDRA

A survey of big data architectures for handling massive data

Databases : Lecture 11 : Beyond ACID/Relational databases Timothy G. Griffin Lent Term Apologies to Martin Fowler ( NoSQL Distilled )

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Introduction to Polyglot Persistence. Antonios Giannopoulos Database Administrator at ObjectRocket by Rackspace

What Next for DBAs in the Big Data Era

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Study concluded that success rate for penetration from outside threats higher in corporate data centers

NoSQL Databases. Nikos Parlavantzas

Structured Data Storage

Chapter 7. Using Hadoop Cluster and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

NoSQL Database Options

Big Data Analytics. Lucas Rego Drumond

Internals of Hadoop Application Framework and Distributed File System

Xiaoming Gao Hui Li Thilina Gunarathne

Transcription:

CS 4604: Introduc0on to Database Management Systems B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Announcements HW4 is out You have to use the PGSQL server START EARLY!! We can not help if everyone first connects the night before it is due On Tuesday 03/25: I will be traveling No office hours. Email me to setup another Tme, if needed. Lecture will be on XML Prof. Eli Tilevich will substtute HW5 will be released, due on April 1 You have to use Amazon AWS, Read directons carefully START EARLY!! Project Assignment will also be released, but will be due April 8 START EARLY!! START EARLY!! START EARLY!! START EARLY!! START EARLY!! START EARLY!! START EARLY!! START EARLY!! START EARLY!! Prakash 2014 VT CS 4604 2

(some slides from Xiao Yu) NO SQL Prakash 2014 VT CS 4604 3

Why No SQL? Prakash 2014 VT CS 4604 4

RDBMS The predominant choice in storing data Not so true for data miners since we much in txt files. First formulated in 1969 by Codd We are using RDBMS everywhere Prakash 2014 VT CS 4604 5

Slide from neo technology, A NoSQL Overview and the Benefits of Graph Databases" Prakash 2014 VT CS 4604 6

When RDBMS met Web 2.0 Slide from Lorenzo Alberton, "NoSQL Databases: Why, what and when" Prakash 2014 VT CS 4604 7

What to do if data is really large? Peta- bytes (exabytes, zeiabytes.. ) Google processed 24 PB of data per day (2009) FB adds 0.5 PB per day Prakash 2014 VT CS 4604 8

BIG data Prakash 2014 VT CS 4604 9

What s Wrong with Rela0onal DB? Nothing is wrong. You just need to use the right tool. RelaTonal is hard to scale. Easy to scale reads Hard to scale writes Prakash 2014 VT CS 4604 10

What s NoSQL? The misleading term NoSQL is short for Not Only SQL. non- relatonal, schema- free, non- (quite)- ACID More on ACID transactons later in class horizontally scalable, distributed, easy replicaton support simple API Prakash 2014 VT CS 4604 11

Four (emerging) NoSQL Categories Key- value (K- V) stores Based on Distributed Hash Tables/ Amazon s Dynamo paper * Data model: (global) collecton of K- V pairs Example: Voldemort Column Families BigTable clones ** Data model: big table, column families Example: HBase, Cassandra, Hypertable *G DeCandia et al, Dynamo: Amazon's Highly Available Key- value Store, SOSP 07 ** F Chang et al, Bigtable: A Distributed Storage System for Structured Data, OSDI 06 Prakash 2014 VT CS 4604 12

Four (emerging) NoSQL Categories Document databases Inspired by Lotus Notes Data model: collectons of K- V CollecTons Example: CouchDB, MongoDB Graph databases Inspired by Euler & graph theory Data model: nodes, relatons, K- V on both Example: AllegroGraph, VertexDB, Neo4j Prakash 2014 VT CS 4604 13

Focus of Different Data Models Slide from neo technology, A NoSQL Overview and the Benefits of Graph Databases" Prakash 2014 VT CS 4604 14

C- A- P theorem" RDBMS NoSQL (most) ParTTon Tolerance Consistency Availability Prakash 2014 VT CS 4604 15

When to use NoSQL? Bigness Massive write performance Twiier generates 7TB / per day (2010) Fast key- value access Flexible schema or data types Schema migraton Write availability Writes need to succeed no maier what (CAP, parttoning) Easier maintainability, administraton and operatons No single point of failure Generally available parallel computng Programmer ease of use Use the right data model for the right problem Avoid hisng the wall Distributed systems support Tunable CAP tradeoffs from hip://highscalability.com/ Prakash 2014 VT CS 4604 16

Key- Value Stores id hair_color age height 1923 Red 18 6 0 3371 Blue 34 NA Table in relatonal db Store/Domain in Key- Value db Find users whose age is above 18? Find all aiributes of user 1923? Find users whose hair color is Red and age is 19? (Join operaton) Calculate average age of all grad students? Prakash 2014 VT CS 4604 17

Voldemort in LinkedIn Sid Anand, LinkedIn Data Infrastructure (QCon London 2012) Prakash 2014 VT CS 4604 18

Voldemort vs MySQL Sid Anand, LinkedIn Data Infrastructure (QCon London 2012) Prakash 2014 VT CS 4604 19

Column Families BigTable like Prakash 2014 VT CS 4604 20 F Chang, et al, Bigtable: A Distributed Storage System for Structured Data, osdi 06

BigTable Data Model The row name is a reversed URL. The contents column family contains the page contents, and the anchor column family contains the text of any anchors that reference the page. Prakash 2014 VT CS 4604 21

BigTable Performance Prakash 2014 VT CS 4604 22

Document Database - mongodb Table in relatonal db Documents in a collecton IniTal release 2009 Open source, document db Json- like document with dynamic schema Prakash 2014 VT CS 4604 23

mongodb Product Deployment Prakash 2014 And much VT CS 4604 more 24

Graph Database Data Model AbstracTon: Nodes RelaTons ProperTes Prakash 2014 VT CS 4604 25

Neo4j - Build a Graph Slide from neo technology, A NoSQL Overview and the Benefits of Graph Databases" Prakash 2014 VT CS 4604 26

A Debatable Performance Evalua0on Prakash 2014 VT CS 4604 27

Conclusion Use the right data model for the right problem Prakash 2014 VT CS 4604 28

THE HADOOP ECOSYSTEM Prakash 2014 VT CS 4604 29

Single vs Cluster 4TB HDDs are coming out Cluster? How many How machines? to analyze such large datasets? Handle machine and drive failure First thing, how to store them? Need redundancy, Single machine? 4TB backup.. drive is out Cluster of machines? How many machines? Need to worry about machine and drive failure. Really? Need data backup, redundancy, recovery, etc. 3% of 100K HDDs fail 3% of 100,000 hard drives in <= 3 months fail within first 3 months hip://statc.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf Prakash 2014 VT CS 4604 30 Failure Trends in a Large Disk Drive Population http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf 5

Hadoop Open source soxware Reliable, scalable, distributed computng as fast Can handle thousands of machines Wriien in JAVA A simple programming model Open-source software for reliable, scalable, distributed com Written in Java Scale to thousands of machines Linear scalability: if you have 2 machines, your job r Uses simple programming model (MapReduce) Fault tolerant (HDFS) Can recover from machine/disk failure (no need to r computation) HDFS (Hadoop Distributed File System) Fault tolerant (can recover from failures) http://hadoop.a Prakash 2014 VT CS 4604 31

Hadoop VS NoSQL Hadoop: computng framework Supports data- intensive applicatons Includes MapReduce, HDFS etc. (we will study MR mainly next) NoSQL: Not only SQL databases Can be built ON hadoop. E.g. HBase. Prakash 2014 VT CS 4604 32

Why Hadoop? Many research groups/projects use it Fortune 500 companies use it Low cost to setup and pick- up Its FREE!! Well not quite, if you do not have the machines J You will be using the Amazon Web Services in HW5 system to hire a Hadoop cluster on demand Prakash 2014 VT CS 4604 33

Map- Reduce [Dean and Ghemawat 2004] AbstracTon for simple computng Hides details of parallelizaton, fault- tolerance, data- balancing MUST Read! hip://statc.googleusercontent.com/media/ research.google.com/en/us/archive/mapreduce- osdi04.pdf Prakash 2014 VT CS 4604 34

Programming Model VERY simple! Input: key/value pairs Output: key/value pairs User has to specify TWO functons: map() and reduce() Map(): takes an input pair and produces k intermediate key/value pairs Reduce(): given an intermediate pair and a set of values for the key, output another key/value pair Prakash 2014 VT CS 4604 35

Master- Slave Architecture: Phases 1. Map phase Divide data and computaton into smaller pieces; each machine mapper works on one piece in parallel. 2. Shuffle phase Master sorts and moves results to reducers 3. Reduce phase Machines ( reducers ) combine results in parallel. Prakash 2014 VT CS 4604 36

Example Example: Histogram of Fruit names Example: histogram of fruit names HDFS Sort the intermediate data on the key (the fruit name here) This Phase also ensures that all tuples of one key end up in only one reducer Map 0 Map 1 Map 2 (apple, 1) (orange, 1) (strawberry,1) Shuffle (apple, 1) Reduce 0 Reduce 1 (apple, 2) HDFS (orange, 1) (strawberry, 1) map( fruit ) { output(fruit, 1); } reduce( fruit, v[1..n] ) { for(i=1; i <=n; i++) sum = sum + v[i]; output(fruit, sum); } CS760 U Kang (2013) 8 Source: U Kang, 2013 Prakash 2014 VT CS 4604 37

Map and Reduce IMPORTANT: 1. each mapper runs the same code (your map functon) 2. diio, for each reducer (your reducer functon) à Code is independent of the degree of paralleliza0on I.E. Code is independent of the actual number of mappers and reducers used. Prakash 2014 VT CS 4604 38

Map- Reduce (MR) as SQL select count(*) from FRUITS group by fruit- name Reducer Mapper Prakash 2014 VT CS 4604 39

More examples Count of URL access frequency Map(): output <URL, 1> Reduce: output <URL, total_count> Reverse Web- link graph Map(): output <target, source> for each target link in a source web- page Reduce(): output <target, list[source]> Even more examples given in the MR paper as well Prakash 2014 VT CS 4604 40

AWS Demo Live demo of word- count example on laptop Prakash 2014 VT CS 4604 41

Degree of graph Example Find degree of every node in a graph Example: In a friendship graph, what is the number of friends of every person: Node 6 = 1 Node 2 = 3 Node 4 = 3 Node 1 = 2 Node 3 = 2 Node 5 = 3 Prakash 2014 VT CS 4604 42

Degree of each node in a graph Suppose you have the edge list === 6 4 == a table! 4 6 4 3 3 4 4 5 5 4... Schema? Edges(from, to) * Caveat: these are undirected graphs. In HW5 you will deal with directed graphs. Prakash 2014 VT CS 4604 43

Degree of each node in a graph Suppose you have the edge list 6 4 === == a table! 4 6 4 3 3 4 4 5 5 4... Schema? Edges(from, to) SQL for degree list? SELECT from, count(*) FROM Edges GROUP BY from Prakash 2014 VT CS 4604 44

Degree of each node in a graph So in SQL: MapReduce? Mapper: emit (from, 1) Reducer: emit (from, count()) SELECT from, count(*) FROM Edges GROUP BY from 6 4 4 6 4 3 3 4 4 5 5 4... Remember I.E. essen0ally equivalent to the fruit- count example J Prakash 2014 VT CS 4604 45

In HW5 We ask you to get the in- and out- degree distribu8ons of a large directed graph You will need 2 consecutve MR jobs for this I.E.: Input (edges) à Map1(.) à Reduce1(.) à Map2 (.) à Reduce2(.) à Output (= degree distributon) Prakash 2014 VT CS 4604 46

MapReduce Architecture Execution Overview CS760 U Kang (2013) 15 Prakash 2014 VT CS 4604 47

Master For each map and reduce task, stores The state (idle, in- progress, completed) The identty of the worker machine (for non- idle tasks) Prakash 2014 VT CS 4604 48

Fault Tolerance Master pings each worker periodically If response Tme- out, worker marked as failed MR task on a failed worker is eligible for rescheduling Prakash 2014 VT CS 4604 49

Locality GFS (Google File System) 3- way replicaton of data LocaTon informaton of the input files MASTER Schedule a map task on a machine with replica Prakash 2014 VT CS 4604 50

Backup Tasks (Specula0ve Execu0on) The problem of straggler When one bad executon prevents completon SoluTon When a MR operaton is close to completon, Master schedules backup executons of the remaining in- progress tasks The task is marked as completed whenever either the primary or backup executon completes 44% faster Prakash 2014 VT CS 4604 51

Refinements SKIP! ParTToning FuncTon Ordering Guarantee Combiner functon Status informaton Counter Prakash 2014 VT CS 4604 52

Status informa0on Master has an internal HTTP server Exports status web- pages Tells you progress informaton Are tasks actve Prakash 2014 VT CS 4604 53

Grep Performance- - - Grep Prakash 2014 VT CS 4604 54

Performance Sort Performance- - - Sort CS760 U Kang (2013) 28 Prakash 2014 VT CS 4604 55

MapReduce in Google MR @ Google Prakash 2014 VT CS 4604 56

MR @ Google Prakash 2014 VT CS 4604 57

Conclusions Hadoop is a distributed data- intesive computng framework MapReduce Simple programming paradigm Surprisingly powerful (may not be suitable for all tasks though) Hadoop has specialized FileSystem, Master- Slave Architecture to scale- up Prakash 2014 VT CS 4604 58

NoSQL and Hadoop Hot area with several new problems Good for academic research Good for industry = Fun AND Profit J Prakash 2014 VT CS 4604 59