Open source large scale distributed data management with Google s MapReduce and Bigtable

Similar documents
Open source Google-style large scale data analysis with Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop IST 734 SS CHUNG

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Jeffrey D. Ullman slides. MapReduce for data intensive computing

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Application Development. A Paradigm Shift

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

MapReduce, Hadoop and Amazon AWS

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

CSE-E5430 Scalable Cloud Computing Lecture 2

MapReduce with Apache Hadoop Analysing Big Data

How To Scale Out Of A Nosql Database

Chapter 7. Using Hadoop Cluster and MapReduce

Lecture Data Warehouse Systems

Big Data With Hadoop

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop implementation of MapReduce computational model. Ján Vaňo

Data-Intensive Computing with Map-Reduce and Hadoop

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop Architecture. Part 1

Big Data and Apache Hadoop s MapReduce

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Apache Hadoop. Alexandru Costan

Apache HBase. Crazy dances on the elephant back

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Qsoft Inc

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Matt Benton, Mike Bull, Josh Fyne, Georgie Mackender, Richard Meal, Louise Millard, Bogdan Paunescu, Chencheng Zhang

NoSQL Data Base Basics

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

So What s the Big Deal?

Large-Scale Data Processing

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

NoSQL and Hadoop Technologies On Oracle Cloud

Mining of Web Server Logs in a Distributed Cluster Using Big Data Technologies

Comparing SQL and NOSQL databases

Challenges for Data Driven Systems

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Large scale processing using Hadoop. Ján Vaňo

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Hadoop Ecosystem B Y R A H I M A.

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

A programming model in Cloud: MapReduce

CS54100: Database Systems

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Cloud Computing at Google. Architecture

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Hadoop & its Usage at Facebook

Map Reduce & Hadoop Recommended Text:

Journal of Environmental Science, Computer Science and Engineering & Technology

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Apache HBase: the Hadoop Database

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Introduction to Hadoop

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

BBM467 Data Intensive ApplicaAons

Apache Hadoop FileSystem and its Usage in Facebook

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Lecture 10 - Functional programming: Hadoop and MapReduce

A very short Intro to Hadoop

Hadoop Parallel Data Processing

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Applications for Big Data Analytics

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Log Mining Based on Hadoop s Map and Reduce Technique

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Hadoop & its Usage at Facebook

Apache Hadoop new way for the company to store and analyze big data

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Big Data Explained. An introduction to Big Data Science.

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Cloudera Certified Developer for Apache Hadoop

Transcription:

Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens

Big Data Facebook: 20TB/day compressed CERN/LHC: 40TB/day (15PB/year) NYSE: 1TB/day 2009 IDC Digital Universe: 800.000 Petabytes or 0.8 Zettabytes Moore's Law: Data doubles every 18 months 2020 prediction: 35 Zettabytes (44 times bigger than 2009)

What is Hadoop? It s a distributed framework for large-scale data processing: Inspired by Google s architecture: Map Reduce and Google File System Can scale to thousands of nodes and petabytes of data A top-level Apache project (since 2008) Hadoop is open source Written in Java, plus a few shell scripts

Why Hadoop? Hadoop is designed to run on cheap commodity hardware Fault-tolerant hardware is expensive It automatically handles data replication and node failure It does the hard work you can focus on processing data

When to use Hadoop? There is access to lots of commodity hardware The processing can be easily parallelized Need to process lots of unstructured data Data intensive applications It is ok to run batch jobs (no need for interactive results)

Architecture HDFS: Distributed file system Hard to store a PB Based on Google FS Fault-tolerant: handles replication, node failure, etc MapReduce : Data aware parallel computation framework Even harder to process a PB Based on a research paper by Google

Hadoop Distributed File System 1/3 Master/Slave Architecture Files are split into one or more blocks and these blocks are stored in a set of DataNodes A Master NameNode a master server that manages the file system namespace and regulates access to files by clients determines the mapping of blocks to DataNodes Many DataNodes Serve client read/write requests Create/delete/replicate blocks

Hadoop Distributed File System 2/3

Hadoop Distributed File System 3/3 HDFS is good for storing large amounts of data, but what about: Transactional data? (e.g. concurrent reads and writes to the same data) Structured data? (e.g. record oriented views, columns) Relational data? (e.g. indexes) HDFS does not support these features

What is HBase? Open source implementation of Google s BigTable Distributed Storage system for structured data Scales to petabytes of data across thousands of commodity servers Primitive relational and transactional ops (NoSQL) Builts on top of Hadoop s HDFS HMaster co-exists with NameNode and knows table locations RegionServers co-exist with DataNodes and are responsible for table regions

Data model - Conceptual view A sparse, distributed, multidimensional sorted map. Map is indexed by row key, column key and timestamp Lexicographic sorting per row key Column key consists of <column family>:<column label> Billions of rows, billions of column labels, hundreds of column families

KeyValue Physical Storage View Key Time Stamp Column "contents:" "com.cnn.www" t6 "<html>..." t5 "<html>..." t3 "<html>..." KeyValue 1 Row Key Time Stamp Column "anchor:" "com.cnn.www" t9 "anchor:cnnsi.com" "CNN" t8 "anchor:my.look.ca" "CNN.com" KeyValue 2 One KeyValue per column family (contents and anchor) Sparse: only columns with values are included per key

From KeyValues to HFiles Contains many sorted KeyValues Has a fixed upper size in MB Index is located at the end of the HFile. Used to quickly locate a single KeyValue

From HFiles to HTables Many HFiles create an HRegion Region is identified by start and end key When HRegions get too large, they are split and two new are created Many HRegions create an HTable HMaster knows the locations of HRegions

HBase Architecture HBase uses HDFS for data access Taken from http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

Hbase Operations Supports basic DBMS operations Put(row_key, column_key,timestamp,value) Get(row_key) and optionally column_key, timestamp, value Scan(start_row_key, end_row_key) No table joins!!! No multi-row transactions Atomic single-row writes Optional atomic single-row reads

Other NoSQL alternatives Cassandra Voldermort Dynamo CouchDB, MongoDB, SimpleDB, Hypertable. And many more: check http://en.wikipedia.org/wiki/nosql

MapReduce 1/3 A programming model A software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes Utilizes HDFS for input/output HDFS stores and MapReduce processes.

MapReduce 2/3 Problem is separated in two different phases, the Map and Reduce phase. Map: Non overlapping chunks (<key,value> records) of input data is assigned to separate processes (mappers) that emit a set of intermediate <key,value> results Reduce: Map results are fed to a usually smaller number of processes called reducers that summarize their input in a smaller number of <key,value> results

MapReduce 3/3

Example: Word Count 1/3 Count the number of times each word appears in a large set of documents Possible usage: find popular urls in log files Work Plan: Upload documents to HDFS Write a map and a reduce function Execute MapReduce job in Hadoop Get the job output in HDFS

Example: Word Count 2/3 map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result)

Example: Word Count 3/3 (d1, w1 w2 w4 ) (d2, w1 w2 w3 w4 ) (d3, w2 w3 w4 ) (d4, w1 w2 w3 ) (d5, w1 w3 w4 ) (d6, w1 w4 w2 w2) (d7, w4 w2 w1 ) (d8, w2 w2 w3 ) (d9, w1 w1 w3 w3 ) (d10, w2 w1 w4 w3 ) (w1, 2) (w2, 3) (w3, 2) (w4,3) (w1,3) (w2,4) (w3,2) (w4,3) (w1,3) (w2,3) (w3,4) (w4,1) (w1,2) (w2,3) (w1,3) (w2,4) (w1,3) (w2,3) (w3,2) (w4,3) (w3,2) (w4,3) (w3,4) (w4,1) (w1,7) (w2,15) (w3,8) (w4,7) M=3 mappers R=2 reducers

When should I use it? Good choice for Indexing log files Sorting vast amounts of data Image analysis Bad choice for Figuring π to 1,000,000 digits Calculating Fibonacci sequences MySQL replacement

Typical problems Log and/or clickstream analysis of various kinds Marketing analytics Machine learning and/or sophisticated data mining Image processing Processing of XML messages Web crawling and/or text processing General archiving, including of relational/tabular data, e.g. for compliance

Hadoop MapReduce Master/Slave architecture A JobTracker Master Runs together with NameNode Receives client job requests Schedules and monitors MR jobs Move computation near the data Speculative execution Many TaskTrackers Run together with DataNodes Perform I/O operations with DataNodes

Use cases 1/3 Large Scale Image Conversions 100 Amazon EC2 Instances, 4TB raw TIFF data 11 Million PDF in 24 hours and 240$ Internal log processing Reporting, analytics and machine learning Cluster of 1110 machines, 8800 cores and 12PB raw storage Open source contributors (Hive) Store and process tweets, logs, etc Open source contributors (hadoop-lzo)

Use cases 2/3 100.000 CPUs in 25.000 computers Content/Ads Optimization, Search index Machine learning (e.g. spam filtering) Open source contributors (Pig) Natural language search (through Powerset) 400 nodes in EC2, storage in S3 Open source contributors (!) to HBase ElasticMapReduce service On demand elastic Hadoop clusters for the Cloud

Use cases 3/3 ETL processing, statistics generation Advanced algorithms for behavioral analysis and targeting Used for discovering People you May Know, and for other apps 3X30 node cluster, 16GB RAM and 8TB storage Leading Chinese language search engine Search log analysis, data mining 300TB per week 10 to 500 node clusters

Questions