Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12

Similar documents
Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Lecture Data Warehouse Systems

Introduction To Hive

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Sentimental Analysis using Hadoop Phase 2: Week 2

Cloudera Certified Developer for Apache Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Implement Hadoop jobs to extract business value from large and varied data sets

Introduction to Hadoop

An Approach to Implement Map Reduce with NoSQL Databases

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Big Data and Scripting Systems build on top of Hadoop

CS 378 Big Data Programming

Big Data and Apache Hadoop s MapReduce

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Can the Elephants Handle the NoSQL Onslaught?

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

NoSQL. Thomas Neumann 1 / 22

The Hadoop Eco System Shanghai Data Science Meetup

Introduction to Apache Hive

American International Journal of Research in Science, Technology, Engineering & Mathematics

Map Reduce & Hadoop Recommended Text:

Cloud Scale Distributed Data Storage. Jürmo Mehine

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Open source large scale distributed data management with Google s MapReduce and Bigtable

Integrating Big Data into the Computing Curricula

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

ITG Software Engineering

How To Scale Out Of A Nosql Database

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop IST 734 SS CHUNG

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Using distributed technologies to analyze Big Data

COURSE CONTENT Big Data and Hadoop Training

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop and Map-Reduce. Swati Gore

Qsoft Inc

Big Data and Scripting Systems build on top of Hadoop

Hadoop Job Oriented Training Agenda

Hadoop implementation of MapReduce computational model. Ján Vaňo

Infrastructures for big data

Introduction to Apache Hive

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Hadoop Distributed File System. -Kishan Patel ID#

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Hadoop Ecosystem B Y R A H I M A.

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Log Mining Based on Hadoop s Map and Reduce Technique

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Chapter 7. Using Hadoop Cluster and MapReduce

MapReduce with Apache Hadoop Analysing Big Data

Data storing and data access

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

A Brief Outline on Bigdata Hadoop

Xiaoming Gao Hui Li Thilina Gunarathne

HadoopRDF : A Scalable RDF Data Analysis System

ITG Software Engineering

Databases 2 (VU) ( )

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Open Source Technologies on Microsoft Azure

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

Apache Hadoop. Alexandru Costan

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data and Scripting map/reduce in Hadoop

Big Data Course Highlights

NoSQL Data Base Basics

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

NoSQL and Hadoop Technologies On Oracle Cloud

How To Handle Big Data With A Data Scientist

Workshop on Hadoop with Big Data

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Large scale processing using Hadoop. Ján Vaňo

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Big Data Weather Analytics Using Hadoop

A Comparison of Approaches to Large-Scale Data Analysis

Introduction to Hadoop

Transcription:

Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12

What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language E.g. Google BigTable, Amazon Dynamo 2. A DBMS where mapreduce is used instead of queries Manual programs to iterate over entire data sets E.g. Hadoop, CouchDB, Dynamo 3. A mapreduce engine with a query language on top: HIVE on top of Hadoop provides HiveQL Provides non-procedural data analythics (select from groupby) without detailed programming Executed in batch as parallel Hadoop jobs

NoSQL Characteristics Highly distributed and parallel architectures Typically runs on data centers This is similar to parallel databases! Highly scalable systems by compromised consistency Eventual consistency Or perhaps never consistency Puts burden on programmer to handle consistency! => Mainly suitable for applications not needing consistency

MapReduce Parallel batch processing using mapreduce Many NoSQL databases use mapreduce for parallel batch processing of data stored in data centers Highly scalable implementation of parallell batch processing of same (e.g. Java, JaVaScript, Pyhon) program over large amounts of data stored in different files Based on a scalable file system (e.g. HDFS) The mapreduce function: Applies a (costly) user function mapper producing key/value pairs in parallel on many nodes accessing files in a cluster Applies a user aggregate function on the key/value pairs produced by the mapper Very similar to GROUP BY in SQL! Read reference article on MapReduce: http://dl.acm.org/citation.cfm?id=1327452.1327492

File I/O Mapreduce Data file Map Partition Data file Map Partition Reduce Data file Data file Map Map Partition Partition Reduce Output Writer Result file Data file Map Partition Reduce Data file Map Partition

Input reader Mapreduce stages System component that reads files from scalable file system (e.g. HDFS) and sends to map functions applied in parallell Map function Applied in parallel on many different files Parses input file data from HDFS Does some (expensive) computation Emits key value pairs as result Result stored by MapReduce system as file Partition function (optional) Partitions output key/value pairs from map function into groups of key/value pairs to be reduced in parallel Usually hash partitioning Reduce function Iterates over set of key/value pairs to produced a reduced set of key value pairs stored in the file system C.f. aggregate functions

HIVE SQL-like query language HiveQL support on top of Hadoop Developed by Facebook, now maintained by Netflix Queries very often involving grouping and statistics Queries compiled into mapreduce jobs Handles 80% of mapreduce analysis tasks at Facebook Much easier to use compared to raw mapreduce coding Substantial impovement of programming productivity Non-programmers can analyze data

Hive architecture

HIVQL example Log file: 2012-02-03 20:26:41 SampleClass3 [TRACE] verbose detail for id 1527353937 2012-02-03 20:26:41 SampleClass2 [TRACE] verbose detail for id 191364434 Java.lang.Exception: 2012-02-03 20:27:10 SampleClass7 [ERROR] incorrect format for id 8276745 2012-02-03 20:29:38 SampleClass1 [DEBUG] detail for id 903114158 HIVEQL: CREATE TABLE logs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '; LOAD DATA LOCAL INPATH 'sample.log' OVERWRITE INTO TABLE logs; SELECT t4 AS sev, COUNT(*) AS cnt FROM logs WHERE t4 LIKE '[%' GROUP BY t4; [TRACE] 2 [DEBUG] 1

HIVQL example Log file: 2012-02-03 20:26:41 SampleClass3 [TRACE] verbose detail for id 1527353937 2012-02-03 20:26:41 SampleClass2 [TRACE] verbose detail for id 191364434 Java.lang.Exception: 2012-02-03 20:27:10 SampleClass7 [ERROR] incorrect format for id 8276745 2012-02-03 20:29:38 SampleClass1 [DEBUG] detail for id 903114158 HIVEQL: SELECT t5 AS sev, COUNT(*) AS cnt FROM logs WHERE t5 LIKE '[%' GROUP BY t5; [ERROR] 1 SELECT L.sev, SUM(L.cnt) FROM ( SELECT t4 AS sev, COUNT(*) AS cnt FROM logs WHERE t4 LIKE '[%' GROUP BY t4 UNION ALL SELECT t5 AS sev, COUNT(*) AS cnt FROM logs WHERE t5 LIKE '[%' GROUP BY t5) L GROUP BY L.sev; [TRACE] 2 [DEBUG] 1 [ERROR] 1 => 3 Mapreduce jobs!

Wordcount in SQL Alt 1, assume words on documents stored in table: CREATE TABLE docs(id INTEGER, word VARCHAR(20), The query becomes: SELECT d.word, COUNT(d.word) PRIMARY KEY(id, word)) FROM docs d GROUP BY d.word Problem: DOCS is table, not stored in file Alt 2, use user defined table function to access documents in file: SELECT d.word, COUNT(d.word) FROM mydocuments( C:/mydocuments ) AS d GROUP BY d.word

HIVEQL vs raw mapreduce Raw mapreduce: Java (Python, C++, etc.) program does map and reduce Very common use of mapreduce: Statistics collection over files (count, sum, stdev, etc) HIVEQQL handles basic statistics 80% of applications When advanced statistics not supported in HIVEQL (or SQL): Alt 1: User defined aggregate functions in HIVE (UDAF) Can be generally used in other queries too Alt 2: Raw mapreduce Code may be complicated Code cannot be used in queries

Reading raw data files to RDBMS Major point with mapreduce: Save time to load database and build index Accessing CSV (Comma Separated Values) file Can treat CSV file as relational table Modern DBMS have bulk load facilities Avoid using insert to bulk load At least do not do commit after each insert (MySQL autommit default) Bulk load speed approaches file copy time Orders of magnitude faster than naïve inserts Automatic parallel bulk loading Binary and formatted bulk loading supported Indexes are voluntary and can be built afterwards

Relational Databases vs mapreduce Modern RDBMSs have user defined table functions and aggregate functions Can read data from files using UDFs User defined aggregate functions are incremental and parallelized RDBMSs have indexing and high parallelism to provide scalability Notice that it takes time to build index when database is loaded => May slow down database loading considerable When is HIVE with mapreduce better than RDBs? One shot queries when no indexing is needed Massively parallel very expensive brute force computations embarrasingly parallel computations Google F1: Use highly distributed DBMS to select input to brute force mapreduce jobs