MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example



Similar documents
Introduction to Parallel Programming and MapReduce

MapReduce (in the cloud)

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data and Apache Hadoop s MapReduce

Big Data Processing with Google s MapReduce. Alexandru Costan

Web Applications Security: SQL Injection Attack

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Introduction to Hadoop

Big Data With Hadoop

Map Reduce / Hadoop / HDFS

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CSE-E5430 Scalable Cloud Computing Lecture 2

Introduction to Hadoop

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)


Parallel Processing of cluster by Map Reduce

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

ImprovedApproachestoHandleBigdatathroughHadoop

Lecture Data Warehouse Systems

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

MAPREDUCE Programming Model

Data Management in the Cloud MAP/REDUCE. Map/Reduce. Programming model Examples Execution model Criticism Iterative map/reduce

A programming model in Cloud: MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Cloud Computing at Google. Architecture

Comparison of Different Implementation of Inverted Indexes in Hadoop

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Architectures for Big Data Analytics A database perspective

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Distributed Text Mining with tm

Hadoop and Map-reduce computing

A computational model for MapReduce job flow

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Hadoop Architecture. Part 1

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Fault Tolerance in Hadoop for Work Migration

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Advanced Data Management Technologies

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Hadoop and Map-Reduce. Swati Gore

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Infrastructures for big data

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Data-Intensive Computing with Map-Reduce and Hadoop

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Integrating Big Data into the Computing Curricula

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Cloud Computing. Chapter Hadoop

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

Logistics. Database Management Systems. Chapter 1. Project. Goals for This Course. Any Questions So Far? What This Course Cannot Do.

Log Mining Based on Hadoop s Map and Reduce Technique

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

MapReduce is a programming model and an associated implementation for processing

Outline. Motivation. Motivation. MapReduce & GraphLab: Programming Models for Large-Scale Parallel/Distributed Computing 2/28/2013

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Improving MapReduce Performance in Heterogeneous Environments

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Principles of Software Construction: Objects, Design, and Concurrency. Distributed System Design, Part 2. MapReduce. Charlie Garrod Christian Kästner

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Can the Elephants Handle the NoSQL Onslaught?

Integrated Network Vulnerability Scanning & Penetration Testing SAINTcorporation.com

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Big Application Execution on Cloud using Hadoop Distributed File System

Transcription:

MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004 1 2 Introduction How to write software for a cluster? 1000, 10,000, maybe more machines Failure or crash is not exception, but common phenomenon Parallelize computation Distribute data Balance load Makes implementation of conceptually straightforward computations challenging Create inverted indices Representations of the graph structure of Web documents Number of pages crawled per host Most frequent queries in a given day MapReduce Abstraction to express computation while hiding messy details Inspired by map and reduce primitives in Lisp Apply map to each input record to create set of intermediate key-value pairs Apply reduce to all values that share the same key (like GROUP BY) Automatically parallelized Re-execution as primary mechanism for fault tolerance 3 4 Programming Model Transforms set of input key-value pairs to set of output key-value pairs Map written by user Map: (k1, v1) list (k2, v2) MapReduce library groups all intermediate pairs with same key together Reduce written by user Reduce: (k2, list (v2)) list (v2) Usually zero or one output value per group Intermediate values supplied via iterator (to handle lists that do not fit in memory) Example Count number of occurrences of each word in a document collection: map( String key, String value ): // key: document name // value: document contents for each word w in value: EmitIntermediate( w, "1 ); reduce( String key, Iterator values ): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt( v ); Emit( AsString(result) ); This is almost all the coding needed (need also mapreduce specification object with names of input and output files, and optional tuning parameters) 5 6 1

Implementation Focuses on large clusters Relies on existence of reliable and highly available distributed file system Map invocations Automatically partition input data into M chunks (16-64 MB typically) Chunks processed in parallel Reduce invocations Partition intermediate key space into R pieces, e.g., using hash(key) mod R Master node controls program execution 7 Execution Overview Mappers inform master about file locations Master notifies reducers about intermediate file locations Reducers (i) read all data from mappers, (ii) sort by intermediate key, (iii) perform computation for each group 8 Fault Tolerance Master monitors tasks on mappers and reducers: idle, inprogress, completed Worker failure (common) Master pings workers periodically No response => assumes worker failed Resets worker s map tasks, completed or in progress, to idle state (tasks now available for scheduling on other workers) Completed tasks only on local disk, hence inaccessible Same for reducer s in-progress tasks Completed tasks stored in global file system, hence accessible Reducers notified about change of mapper assignment Master failure (unlikely) Checkpointing or simply abort computation Practical Considerations Conserve network bandwidth ( Locality optimization ) Distributed file system assigns data chunks to local disks Schedule map task on machine that already has a copy of the chunk, or one nearby Choose M and R much larger than number of worker machines Load balancing and faster recovery (many small tasks from failed machine) Limitation: O(M+R) scheduling decisions and O(M*R) in-memory state at master Common choice: M so that chunk size is 16-64 MB, R a small multiple of number of workers Backup tasks to deal with machines that take unusually long for last few tasks For in-progress tasks when MapReduce near completion 9 11 Applicability of MapReduce MapReduce vs. DBMS Machine learning algorithms, clustering Data extraction for reports of popular queries Extraction of page properties, e.g., geographical location Graph computations Google indexing system Sequence of 5-10 MapReduce operations Smaller simpler code (3800 LOC -> 700 LOC) Easier to change code Easier to operate, because MapReduce library takes care of failures Easy to improve performance by adding more machines Map: assume table InputFile with schema (key1, val1) is input; mapfct is a user-defined function that can output a set with schema (key2, val2) SELECT mapfct( key1, val1) AS (key2, val2) // Not really correct SQL FROM InputFile Reduce: assume MapOutput has schema (key2, val2); redfct is a user-defined function SELECT redfct( val2 ) FROM MapOutput GROUP BY key2 13 14 2

Parallel DBMS SQL specifies what to compute, not how to do it Perfect for parallel and distributed implementation Just need an optimizer that can choose best plan in given parallel/distributed system Cost estimate includes disk, CPU, and network cost Recent benchmarks show parallel DBMS can significantly outperform MapReduce But many programmers prefer writing Map and Reduce in familiar PL (C++, Java) Recent trend: High-level PL for writing MapReduce programs with DBMS-inspired operators MapReduce Summary MapReduce = programming model that hides details of parallelization, fault tolerance, locality optimization, and load balancing Simple model, but fits many common problems Implementation on cluster scales to 1000s of machines and more Open source implementation, Hadoop, is available Parallel DBMS, SQL are more powerful than MapReduce and similarly allow automatic parallelization of sequential code Never really achieved mainstream acceptance or broad open-source support like Hadoop Recent trend: simplify coding in MapReduce by using DBMS ideas (Variants of) relational operators, implemented on top of Hadoop 15 16 SQL Injection Getting Started Exploits security vulnerability in database layer of a Web application when user input is not sufficiently checked and sanitized Think DBMS access through Web forms Main idea: pass carefully crafted string as parameter value for an SQL query String executes harmful code Reveals data to unauthorized user Data modification by unauthorized user Deletes entire table The following examples are from unixwiz.net 17 Assume we know nothing about Web application, except that it probably checks user email with query like this: WHERE attribute = $email ; Typical for Web form allowing user login and send password to user s email address $email is email address submitted by user through Web form Try entering name@xyz.com in form: WHERE attribute = name@xyz.com ; 18 First Code Injection Query has incorrect SQL syntax Getting syntax error message indicates that input is sent to server unsanitized Now try injecting additional code : WHERE attribute = anything OR x = x ; Legal query whose WHERE clause is always satisfied Might see response from system like Your login info has been sent to somebody@somewhere.com Enough information to start exploring the actual query structure 19 Guess Names of Attributes Try if email is the right attribute name: WHERE attribute = x AND email IS NULL; -- ; Server error would indicate that attribute name email is probably wrong; if so, try others Valid response (e.g., Address unknown ) indicates that attribute name was correctly guessed Can guess names of other attributes like passwd, login_id, full_name and so on 20 3

Guess Table Name Try if tabname is a valid table name: WHERE attribute = x AND 1 = (SELECT COUNT(*) FROM tabname); -- ; If no server error, found valid table name, e.g., members But is it the name of the table used for the query behind the Web form? Find Table Name for Unknown Query Try query that only works if table members is part of the query: WHERE attribute = x AND members.email IS NULL; -- ; Error like Email address unknown indicates that query was syntactically correct, i.e., members is a table in the FROM clause 21 22 Finding Users Guessing Passwords Look on application s Web pages to find names of people, then find them in the database (recall that full_name was found to be an attribute): WHERE attribute = x OR full_name LIKE %Bob% ; If server returns message like Sent your password to bob@example.com, found some Bob s email in database 23 Try password through same query form (recall that passwd was found to be an attribute): WHERE attribute = bob@example.com AND passwd = pwd123 ; Found password when Your password has been mailed to message appears Tedious guessing procedure, but can be automated with script 24 Deleting a Table Inject a DROP TABLE statement for the table names found earlier: WHERE attribute = x ; DROP TABLE members; -- ; and table members is gone, unless permissions do not allow it to be deleted by Web app. Adding a New Member Inject an INSERT statement like the DROP TABLE statement before Possible problems: Input string length in Web form might be limited Web app might not have insert permission Some attribute names might be unknown still, and might require values in the INSERT Foreign key relationships, CHECKs etc might require other updates before new member tuple can be inserted So, let s try something different 25 26 4

Modify Existing Tuples Replace email address to get password mailed to new address: WHERE attribute = x ; UPDATE members SET email = myemailaddress WHERE email = bob@example.com ; Then use the Email me my password link Now have access to the system as Bob, who probably is important (if his name was mentioned as Web admin etc.) 27 Sanitize form input received from users Only allow characters that could occur in email address (or whatever the form field is for) Escape/quotesafe the input (prevent illegal use of character) Name like O Reilly is legal string O Reilly, but WHERE name = \ ; DROP TABLE members; -- ; should be prevented Difficult, but functions exist for identifying if something is an escape string 28 Use bound parameters (preparedstatement) Limit database permissions for Web app PreparedStatement ps = con.preparedstatement( SELECT email FROM member WHERE name =? ); ps.setstring(1, formfield); ResultSet rs = ps.executequery(); Any code injected into form field will just be part of the name field s value Works similarly if email is input field of stored procedure Isolate the Web server Even if Web server is compromised by SQL injection, make sure it cannot do much harm Properly configure error reporting Do not output developer debugging information on unexpected inputs 29 30 Final Comment From xkcd.com/327 31 5