Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

Similar documents
MapReduce: A Flexible Data Processing Tool

A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis

Parallel Databases vs. Hadoop

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

BIG DATA What it is and how to use?

Hadoop vs. Parallel Databases. Juliana Freire!

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Big Data Course Highlights

Cloud Computing at Google. Architecture

Data Management in the Cloud MAP/REDUCE. Map/Reduce. Programming model Examples Execution model Criticism Iterative map/reduce

Demystified CONTENTS Acknowledgments xvii Introduction xix CHAPTER 1 Database Fundamentals CHAPTER 2 Exploring Relational Database Components

Data Management in the Cloud -

Vertica Live Aggregate Projections

Big Data With Hadoop

Big Data and Apache Hadoop s MapReduce

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

CS 564: DATABASE MANAGEMENT SYSTEMS

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat


Lecture Data Warehouse Systems

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Comparing SQL and NOSQL databases

Introduction. Introduction: Database management system. Introduction: DBS concepts & architecture. Introduction: DBS versus File system

Introduction to Hadoop

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Introduction: Database management system

YANG, Lin COMP 6311 Spring 2012 CSE HKUST

Internals of Hadoop Application Framework and Distributed File System

Hadoop and Map-Reduce. Swati Gore

DBMS Questions. 3.) For which two constraints are indexes created when the constraint is added?

Hadoop Job Oriented Training Agenda

Hadoop IST 734 SS CHUNG

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

ITG Software Engineering

Chapter 1: Introduction

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Data. Data and database. Aniel Nieves-González. Fall 2015

Daniel J. Adabi. Workshop presentation by Lukas Probst

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

DBMS / Business Intelligence, SQL Server

Workshop on Hadoop with Big Data

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Data Management in the Cloud

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Instant SQL Programming

Big Data and Scripting map/reduce in Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Scheme G. Sample Test Paper-I

Cost-Effective Business Intelligence with Red Hat and Open Source

The Performance of MapReduce: An In-depth Study

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Information Processing, Big Data, and the Cloud

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

VALLIAMMAI ENGNIEERING COLLEGE SRM Nagar, Kattankulathur

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

CS 378 Big Data Programming

Architectures for Big Data Analytics A database perspective

Basic Concepts of Database Systems

Oracle Database 10g: Introduction to SQL

Open source large scale distributed data management with Google s MapReduce and Bigtable

How To Scale Out Of A Nosql Database

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Using distributed technologies to analyze Big Data

Oracle EXAM - 1Z Oracle Database 11g Release 2: SQL Tuning. Buy Full Product.

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

NoSQL. Thomas Neumann 1 / 22

SQL, PL/SQL FALL Semester 2013

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Report Data Management in the Cloud: Limitations and Opportunities

CS 378 Big Data Programming. Lecture 2 Map- Reduce

1Z0-117 Oracle Database 11g Release 2: SQL Tuning. Oracle

Advanced Big Data Analytics with R and Hadoop

Overview of Data Management

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

Big Systems, Big Data

Open source Google-style large scale data analysis with Hadoop

Can the Elephants Handle the NoSQL Onslaught?

HadoopRDF : A Scalable RDF Data Analysis System

Database 10g Edition: All possible 10g features, either bundled or available at additional cost.

1 File Processing Systems

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Xiaoming Gao Hui Li Thilina Gunarathne

Data Mining in the Swamp

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Transcription:

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900) Ian Foster Computation Institute Argonne National Lab & University of Chicago

2

3

SQL Overview Structured Query Language The standard for relational database management systems (RDBMS) RDBMS: A database management system that manages data as a collection of tables in which all relationships are represented by common values in related tables 4 4

History of SQL 5 5

Catalog SQL Environment A set of schemas that constitute the description of a database Schema (or Database) The structure that contains descriptions of objects created by a user (base tables, views, constraints) Data Definition Language (DDL) Commands that define a database, including creating, altering, and dropping tables and establishing constraints Data Manipulation Language (DML) Commands that maintain and query a database Data Control Language (DCL) Commands that control a database, including administering privileges and committing data 6 6

7

A table called List_of_people 8

Figure 7-4 DDL, DML, DCL, and the database development process 9 9

Common SQL Commands Data Definition Language (DDL): Create Drop Alter Data Manipulation Language (DML): Select Update Insert Delete Data Control Language (DCL): Grant Revoke 10 10

Internal Schema Definition Control processing/storage efficiency: Choice of indexes File organizations for base tables File organizations for indexes Data clustering Statistics maintenance Creating indexes Speed up random/sequential access to base table data Example CREATE INDEX NAME_IDX ON CUSTOMER_T (CUSTOMER_NAME) This makes an index for the CUSTOMER_NAME field of the CUSTOMER_T table DROP INDEX NAME_IDX 11 11

SELECT Statement 12 12

MapReduce or SQL?

An example problem We have a large number of documents, each labeled in some way with the name of the site where they occur Find sites with documents that contain more than five instances of the words IBM or Google 14

MapReduce approach Map: Create a histogram for each document listing frequently occurring words Reduce: Group documents by their site of origin Map: Identify documents with more than five occurrences 15

map(string key, String value) // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1") 16

map(string key, String value) // key: (site id + document name) // value: document contents histogram = CountWords(value); EmitIntermediate (site-id(key), (value,histogram)); 17

SQL solution Assume a table Documents of the form: (siteid, docid, word, ) 18

MapReduce A major step backwards A giant step backward No schemas, Codasyl instead of Relational A sub-optimal implementation Uses brute force sequential search, instead of indexing Materializes O(m.r) intermediate files Does not incorporate data skew Not novel at all Represents a specific implementation of well known techniques developed nearly 25 years ago Missing most common current DBMS features Bulk loader, indexing, updates, transactions, integrity constraints, referential Integrity, views Incompatible with DBMS tools Report writers, business intelligence tools, data mining tools, replication tools, database design tools 19

Architectural Element Parallel Databases MapReduce Schema Support Structured Unstructured Indexing B- Trees or Hash based None Programming Model Relational Codasyl Data Distribution Projections before aggregation Logic moved to data, but no optimizations Execution Strategy Push Pull Flexibility No, but Ruby on Rails, LINQ Yes Fault Tolerance Transactions have to be restarted in the event of a failure Yes: Replication, Speculative execution 20 20

MapReduce response They label as misconceptions: MapReduce cannot use indices and implies a full scan of all input data Data on each node can be indexed or otherwise partitionable MapReduce input and outputs are always simple files in a file system No, can be databases, tables, etc. MapReduce requires the use of inefficient textual data formats No, Google often uses other formats 21

The comparison paper says, "MR is always forced to start a query with a scan of the entire input file." MapReduce does not require a full scan over the data; it requires only an implementation of its input interface to yield a set of records that match some input specification. Examples of input specifications are: All records in a set of files All records with a visit-date in the range [2000-01-15..2000-01-22] All data in Bigtable table T whose "language" column is "Turkish." 22

Extracting outgoing links from a collection of HTML documents and aggregating by target document; Stitching together overlapping satellite images to remove seams and to select high-quality imagery for Google Earth Generating a collection of inverted index files using a compression scheme tuned for efficient support of Google search queries Processing all road segments in the world and rendering map tile images that display these segments for Google Maps Fault-tolerant parallel execution of programs written in higher-level languages such as Sawzall and Pig Latin 23

Grep example Scan through a data set of 100-byte records looking for a three-character pattern. Each record consists of a unique key in the first 10 bytes, followed by a 90-byte random value. The search pattern is only found in the last 90 bytes once in every 10,000 records. 24

Dataset Record = 10B key + 90B random value 5.6 million records = 535MB/node Another set = 1TB/ cluster Data Loading Hadoop Command line utility DBMS-X LOAD SQL command Administrative command to reorganize data 25

Grep Task Results SELECT * FROM Data WHERE field LIKE %XYZ% ; 26

Select Task Results SELECT pageurl, pagerank FROM Rankings WHERE pagerank > X; 27

Join Task 28

Summary DBMS-X 3.2 times, Vertica 2.3 times faster than Hadoop Parallel DBMS win because B-tree indices to speed the execution of selection operations, novel storage mechanisms (e.g., column-orientation) aggressive compression techniques with ability to operate directly on compressed data sophisticated parallel algorithms for querying large amounts of relational data. Ease of installation and use Fault tolerance? Loading data? 29