Big Data Storage, Management and challenges. Ahmed Ali-Eldin



Similar documents
Big Data Processing with Google s MapReduce. Alexandru Costan

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

Map Reduce / Hadoop / HDFS

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

CSE-E5430 Scalable Cloud Computing Lecture 2

Lecture Data Warehouse Systems

MapReduce (in the cloud)

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data and Apache Hadoop s MapReduce

Introduction to Parallel Programming and MapReduce

Big Data With Hadoop

Architectures for massive data management

Hadoop IST 734 SS CHUNG

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Open source large scale distributed data management with Google s MapReduce and Bigtable

A programming model in Cloud: MapReduce

Parallel Processing of cluster by Map Reduce

16.1 MAPREDUCE. For personal use only, not for distribution. 333

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

The Hadoop Framework

Data-Intensive Computing with Map-Reduce and Hadoop

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Big Systems, Big Data

Advanced Data Management Technologies

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Cloud Computing at Google. Architecture

Introduction to Hadoop

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Performance and Energy Efficiency of. Hadoop deployment models

Introduction to Cloud Computing

Big Data Processing in the Cloud. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

MAPREDUCE Programming Model

Hadoop and Map-Reduce. Swati Gore

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Hadoop and Map-reduce computing

CS54100: Database Systems

Big Data Analytics* Outline. Issues. Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Introduction to Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

MapReduce and Hadoop Distributed File System

The Google File System

Map Reduce & Hadoop Recommended Text:

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

BIG DATA, MAPREDUCE & HADOOP

Application Development. A Paradigm Shift

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Hadoop

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Optimization and analysis of large scale data sorting algorithm based on Hadoop

CS 378 Big Data Programming

Big Data Analytics Hadoop and Spark

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Improving MapReduce Performance in Heterogeneous Environments

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Integrating Big Data into the Computing Curricula

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop Distributed File System V I J A Y R A O

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop Architecture. Part 1

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

A Brief Outline on Bigdata Hadoop

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparison of Different Implementation of Inverted Indexes in Hadoop

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Hadoop & its Usage at Facebook

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Forensic Clusters: Advanced Processing with Open Source Software. Jon Stewart Geoff Black

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Hadoop WordCount Explained! IT332 Distributed Systems

Big Table A Distributed Storage System For Data

Transcription:

Big Data Storage, Management and challenges Ahmed Ali-Eldin

(Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big Data Map-Reduce (Hadoop) Under the Hood (DS at work) Paxos for consensus Chubby (Google s implementation) ZooKeeper (Yahoo s implementation. Last Year) How to transfer Big Data Kafka (LinkedIn s solution)

What is Big Data? Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it. Edd Dumbill

What is Big Data? It is Big Brother s (Orwell s 1984) enabler to watch the people of Oceania Every day, humans create 2.5 quintillion (10 21 ) bytes of data (quote from IBM) 100 hours of video are uploaded to YouTube every minute. All needs to be stored. All needs to be searched 350 million new photos each day uploaded to FB 25 to 30 Petabytes of data needs to be stored by the LHC Much much much more is produced per day (at least 1 petabyte per day that is filtered data)

Big Data Storage

Quick Review on ACID Atomicity if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged Consistency any transaction will bring the database from one valid state to another Isolation concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially Durability once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors.

CAP Theorem For a Distributed system (and database) you can mainain two of three properties: Consistency Availability Partition tolerance

BASE: A newer data storage paradigm Maintaining ACID is impossible for highly distributed data stores You never want to sacrifice availability Basically Available, Soft state, Eventual consistency

Great Ideas in Computing Lecture 3: MapReduce

Example: Counting Words Count the most popular terms on the Internet: Sounds easy Why not this? CountWords(internet): hash_map<string, int> word_count; for each page in internet: for each word in page: word_count[word] += 1

Motivation: Large-Scale Data Processing Example: 20+ billion web pages x 20KB = 400+ ~1,000 hard drives just to store the web terabytes One computer can read 30-35 MB/sec from disk ~four months to read the web Even more to do something with the data

Large-Scale Data Processing 1. Problems: 1. Time 2. Errors / Problems 2. Solution: 1. Distribute processing on many machines

Indexing Algorithm Input: Billions of documents (html) Distributed across many machines Indexing Pipeline Output: Index of terms: word1 -> document 1, document 2, etc

Link Reversal Input: Billions of documents (html) Distributed across many machines Link Reversal Output: Incoming links for each document: URL1 -> URL2, URL3, URL5,

Any Algorithm Input: Billions of? Distributed across many machines? Output:?

Before MapReduce Each team needed to understand distributed computing Extra code Extra complexity...

Algorithm Requirements: Has to be scalable (terabytes of data) Has to be fast Has to be fault-tolerant Has to be easy to implement and flexible The solution: MapReduce: a distributed algorithm

MapReduce: Introduction MapReduce is a programming model and an associated implementation for processing and generating large data sets. Can run on thousands of machines Can process terabytes of information (input / output) It is fault-tolerant It is easy to implement

MapReduce Model Programming model inspired by functional language primitives, e.g., LISP 1. Read a lot of data 2. Map: extract something you care about from each record 3. Shuffle and Sort 4. Reduce: aggregate, summarize, filter, or transform 5. Write the results Outline stays the same, map and reduce change

MapReduce: Logical Diagram INPUT Grouping MAP Intermediate data G REDU CE OUTPUT

Programming Model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs GROUPING reduce (out_key, list(intermediate_value)) -> list (out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one)

MapReduce: Logical Diagram INPUT Grouping MAP Intermediate data G REDU CE OUTPUT

MapReduce: Map Phase First, the MAP function: Programmer supplies this function map (in_key, in_value) -> list(out_key, intermediate_value) Input = any pair (e.g., URL, document) Output = list of pairs (key/value)

MapReduce: Map Example Example: word counting <URL, document text> -> Output: list(<word, count>) E.g. for one document <URL, nile university is the best university > Output -> list(<nile, 1>, <university, 1>, <is, 1>, <the, 1>, <best, 1>, <university, 1> ) For another document <URL, "UNIX is simple and elegant"> Output -> list(<unix, 1>, <is, 1>, <simple, 1>...)

MapReduce: Map Example Example: word counting <URL, document text> Output: list(<word, count>) Code: Map(string input_key, string input_value): // key: document URL // value: document text for each word in input_value: OutputIntermediate(word, 1 );

MapReduce: Logical Diagram INPUT Grouping MAP Intermediate data G REDU CE OUTPUT

MapReduce: Grouping The GROUPING (or combining) This happens automatically (by the library) Put all values for same key together For our example: <nile, list(1)> <university, list(1, 1)> <the, list(1)> <is, list(1, 1)> <best, list(1)>...

MapReduce: Logical Diagram INPUT Grouping MAP Intermediate data G REDU CE OUTPUT

MapReduce: Reduce Last phase is REDUCE Takes the output from grouping: reduce (out_key, list(intermediate_value)) -> list (out_value)

MapReduce: Reduce Example Example (word-counting): <university, list(1, 1)> <university, 2> Reduce(string key, List values): // key: word, same for input and output // values: list of counts int sum = 0; for each v in values: sum += v; Output(sum);

MapReduce: Word Count Want to count how many times each word occurs MAP: Input: <DocumentId, Full Text> For time a word occurs: Output: <Word, 1> REDUCE: Input: <Word, list(1, 1, )> Sum over each word: Output <Word, Count>

MapReduce: Grep (search) Want to search terabytes of documents for where a word occurs MAP: Input: <DocumentId, Full Text> For each line where the word occurs: Output: <DocumentId, LineNumber> REDUCE: Input: <DocumentId, list(linenumber)> Do nothing (output == input)

MapReduce: Link Reversal Want to build a reverse-link index MAP: Input: <source_url, Full Text> For each outgoing link <a href= (destination_url ) >: Output: <destination_url, source_url> REDUCE: Input: < destination_url, list(source_url)> Join source_url Output: for each URL, a list of all other URLs that link to it.

MapReduce Usage Index construction Web access log stats, e.g., search trends (Google Zeitgeist) Distributed grep Distributed sort Most popular terms on the Internet Document clustering (news, products, etc.) Find all the pages that link to each document (link reversal) Machine learning Statistical machine translation

Why complicate things?

MapReduce: Parallelism Each Map call is independent (parallelize) Run M different Map tasks (each takes a chunk of input) Each Reduce call is independent (parallelize) Run R different Reduce tasks (each produces a chunk of output)

MapReduce: Parallelism

Framework User: Map function (used in Map Phase) Reduce function (used in Reduce Phase) Options (Map tasks, Reduce tasks, Workers, Input, Output) Framework: Execute map and reduce phase (distributedly) Collect/move data: input intermediate output data Handle faults Dynamic load-balancing and scheduling

MapReduce: Scheduling One master, many workers Input data split into M map tasks (16-64MB in size) Reduce phase partitioned into R reduce tasks Tasks are assigned to workers dynamically Often: M =200,000; R =4,000; workers=2,000 Master assigns each map task to a free worker Considers locality of data to worker when assigning task Worker reads task input (often from local disk!) Worker produces R local files containing intermediate k/v pairs Master assigns each reduce task to a free worker Worker reads intermediate k/v pairs from map workers Worker sorts & applies user s Reduce op to produce the output

MapReduce: The Master One machine is the Master Controls other machines (workers), start or stop MAP/REDUCE Keeps track of progress of each worker Stores location of input, output of each worker There are MAP workers and REDUCE workers

MapReduce: Parallelism

MapReduce: Map Parallelism MAP worker1 MAP output 1 MAP output 2 Input #1 Input #2 Input #3 MAP worker2 MAP output 1 MAP output 2 MAP worker3 MAP output 1 MAP output 2

MapReduce: Map Parallelism MAP output is split for REDUCERs by key Bins for keys E.g. keys starting (a m) output1, (n z) output2 (Usually use <hash(key) modulo (# of reducers)> for more random partition) MAP worker1 MAP output 1 MAP output 2

MapReduce: Reduce Parallelism MAP worker1 MAP output 1 MAP output 2 REDUCE worker1 MAP worker2 M1 M2 MAP worker3 M1 M2 REDUCE worker2

MapReduce: Reduce Parallelism Each REDUCE worker reads one section of input from all MAP workers Performs GROUPing Performs reducing Stores output REDUCE worker1 Output 1

MapReduce: Fault Tolerance More machines = more chance of failure Probability of one machine failing p (small) If each machine fails independently, chance of any one out of k machine failing 1 - (1 p)n (larger than p) If p = 0.01, probability of one failure: 1 machine 0.01 2 machines 0.0199 10 machines 0.096 100 machines 0.63 1000 machines 0.99

MapReduce: Fault Tolerance If one machine fails, the program should not fail If many machines fail, the program should not fail

MapReduce: Fault Tolerance Worker fails (happens often): Master checks workers If worker responds as failed, or does not respond: Master restarts on different machine Bad input: Master keeps track of bad input Asks workers to skip bad input Master fails (happens rarely): Can abort operation Or, can restart master on a different machine

MapReduce: Optimizations No REDUCE can start until all MAPs are complete (bottleneck) Sometimes one machine is really slow (bad RAM, corrupted disk, etc) Master notices slow machines Executes input on another machine Uses output of machine that finishes first Other optimizations Locality optimization Intermediate data compression

Task Granularity and Pipelining Fine granularity tasks: many more map tasks than machines e.g., 200,000 map/5000 reduce tasks w/ 2000 machines Minimizes time for fault recovery Can pipeline grouping with map execution Better dynamic load balancing

MapReduce: Performance Using 1,800 machines: Two 2GHz Intel Xeon 4GB RAM Two 160GB disks Gigabit Ethernet MR_Grep scanned 1 terabyte in 100 seconds MR_Sort sorted 1 terabyte of 100 byte records in 14 minutes

Conclusion The internet is composed of billions of documents To process that information, we need to use distributed algorithms and run processes in parallel MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Fun to use: focus on problem, let library deal w/ messy details

MapReduce: Beyond Google Hadoop: Open-source implementation of MapReduce MapReduce implementations in C++, Java, Python, Perl, Ruby, Erlang... Most companies that run distributed algorithms have an implementation of MapReduce (Facebook, Nokia, MySpace,...)

For Next Class: Read the Google Labs MapReduce paper (by Ghemawat and Dean)