Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900) Ian Foster Computation Institute Argonne National Lab & University of Chicago

SQL Overview Structured Query Language The standard for relational database management systems (RDBMS) RDBMS: A database management system that manages data as a collection of tables in which all relationships are represented by common values in related tables 4 4

History of SQL 5 5

Catalog SQL Environment A set of schemas that constitute the description of a database Schema (or Database) The structure that contains descriptions of objects created by a user (base tables, views, constraints) Data Definition Language (DDL) Commands that define a database, including creating, altering, and dropping tables and establishing constraints Data Manipulation Language (DML) Commands that maintain and query a database Data Control Language (DCL) Commands that control a database, including administering privileges and committing data 6 6

A table called List_of_people 8

Figure 7-4 DDL, DML, DCL, and the database development process 9 9

Common SQL Commands Data Definition Language (DDL): Create Drop Alter Data Manipulation Language (DML): Select Update Insert Delete Data Control Language (DCL): Grant Revoke 10 10

Internal Schema Definition Control processing/storage efficiency: Choice of indexes File organizations for base tables File organizations for indexes Data clustering Statistics maintenance Creating indexes Speed up random/sequential access to base table data Example CREATE INDEX NAME_IDX ON CUSTOMER_T (CUSTOMER_NAME) This makes an index for the CUSTOMER_NAME field of the CUSTOMER_T table DROP INDEX NAME_IDX 11 11

SELECT Statement 12 12

MapReduce or SQL?

An example problem We have a large number of documents, each labeled in some way with the name of the site where they occur Find sites with documents that contain more than five instances of the words IBM or Google 14

MapReduce approach Map: Create a histogram for each document listing frequently occurring words Reduce: Group documents by their site of origin Map: Identify documents with more than five occurrences 15

map(string key, String value) // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1") 16

map(string key, String value) // key: (site id + document name) // value: document contents histogram = CountWords(value); EmitIntermediate (site-id(key), (value,histogram)); 17

SQL solution Assume a table Documents of the form: (siteid, docid, word, ) 18

MapReduce A major step backwards A giant step backward No schemas, Codasyl instead of Relational A sub-optimal implementation Uses brute force sequential search, instead of indexing Materializes O(m.r) intermediate files Does not incorporate data skew Not novel at all Represents a specific implementation of well known techniques developed nearly 25 years ago Missing most common current DBMS features Bulk loader, indexing, updates, transactions, integrity constraints, referential Integrity, views Incompatible with DBMS tools Report writers, business intelligence tools, data mining tools, replication tools, database design tools 19

Architectural Element Parallel Databases MapReduce Schema Support Structured Unstructured Indexing B- Trees or Hash based None Programming Model Relational Codasyl Data Distribution Projections before aggregation Logic moved to data, but no optimizations Execution Strategy Push Pull Flexibility No, but Ruby on Rails, LINQ Yes Fault Tolerance Transactions have to be restarted in the event of a failure Yes: Replication, Speculative execution 20 20

MapReduce response They label as misconceptions: MapReduce cannot use indices and implies a full scan of all input data Data on each node can be indexed or otherwise partitionable MapReduce input and outputs are always simple files in a file system No, can be databases, tables, etc. MapReduce requires the use of inefficient textual data formats No, Google often uses other formats 21

The comparison paper says, "MR is always forced to start a query with a scan of the entire input file." MapReduce does not require a full scan over the data; it requires only an implementation of its input interface to yield a set of records that match some input specification. Examples of input specifications are: All records in a set of files All records with a visit-date in the range [2000-01-15..2000-01-22] All data in Bigtable table T whose "language" column is "Turkish." 22

Extracting outgoing links from a collection of HTML documents and aggregating by target document; Stitching together overlapping satellite images to remove seams and to select high-quality imagery for Google Earth Generating a collection of inverted index files using a compression scheme tuned for efficient support of Google search queries Processing all road segments in the world and rendering map tile images that display these segments for Google Maps Fault-tolerant parallel execution of programs written in higher-level languages such as Sawzall and Pig Latin 23

Grep example Scan through a data set of 100-byte records looking for a three-character pattern. Each record consists of a unique key in the first 10 bytes, followed by a 90-byte random value. The search pattern is only found in the last 90 bytes once in every 10,000 records. 24

Dataset Record = 10B key + 90B random value 5.6 million records = 535MB/node Another set = 1TB/ cluster Data Loading Hadoop Command line utility DBMS-X LOAD SQL command Administrative command to reorganize data 25

Grep Task Results SELECT * FROM Data WHERE field LIKE %XYZ% ; 26

Select Task Results SELECT pageurl, pagerank FROM Rankings WHERE pagerank > X; 27

Join Task 28

Summary DBMS-X 3.2 times, Vertica 2.3 times faster than Hadoop Parallel DBMS win because B-tree indices to speed the execution of selection operations, novel storage mechanisms (e.g., column-orientation) aggressive compression techniques with ability to operate directly on compressed data sophisticated parallel algorithms for querying large amounts of relational data. Ease of installation and use Fault tolerance? Loading data? 29