Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens

Big Data Facebook: 20TB/day compressed CERN/LHC: 40TB/day (15PB/year) NYSE: 1TB/day 2009 IDC Digital Universe: 800.000 Petabytes or 0.8 Zettabytes Moore's Law: Data doubles every 18 months 2020 prediction: 35 Zettabytes (44 times bigger than 2009)

What is Hadoop? It s a distributed framework for large-scale data processing: Inspired by Google s architecture: Map Reduce and Google File System Can scale to thousands of nodes and petabytes of data A top-level Apache project (since 2008) Hadoop is open source Written in Java, plus a few shell scripts

Why Hadoop? Hadoop is designed to run on cheap commodity hardware Fault-tolerant hardware is expensive It automatically handles data replication and node failure It does the hard work you can focus on processing data

When to use Hadoop? There is access to lots of commodity hardware The processing can be easily parallelized Need to process lots of unstructured data Data intensive applications It is ok to run batch jobs (no need for interactive results)

Architecture HDFS: Distributed file system Hard to store a PB Based on Google FS Fault-tolerant: handles replication, node failure, etc MapReduce : Data aware parallel computation framework Even harder to process a PB Based on a research paper by Google

Hadoop Distributed File System 1/3 Master/Slave Architecture Files are split into one or more blocks and these blocks are stored in a set of DataNodes A Master NameNode a master server that manages the file system namespace and regulates access to files by clients determines the mapping of blocks to DataNodes Many DataNodes Serve client read/write requests Create/delete/replicate blocks

Hadoop Distributed File System 2/3

Hadoop Distributed File System 3/3 HDFS is good for storing large amounts of data, but what about: Transactional data? (e.g. concurrent reads and writes to the same data) Structured data? (e.g. record oriented views, columns) Relational data? (e.g. indexes) HDFS does not support these features

What is HBase? Open source implementation of Google s BigTable Distributed Storage system for structured data Scales to petabytes of data across thousands of commodity servers Primitive relational and transactional ops (NoSQL) Builts on top of Hadoop s HDFS HMaster co-exists with NameNode and knows table locations RegionServers co-exist with DataNodes and are responsible for table regions

Data model - Conceptual view A sparse, distributed, multidimensional sorted map. Map is indexed by row key, column key and timestamp Lexicographic sorting per row key Column key consists of <column family>:<column label> Billions of rows, billions of column labels, hundreds of column families

KeyValue Physical Storage View Key Time Stamp Column "contents:" "com.cnn.www" t6 "<html>..." t5 "<html>..." t3 "<html>..." KeyValue 1 Row Key Time Stamp Column "anchor:" "com.cnn.www" t9 "anchor:cnnsi.com" "CNN" t8 "anchor:my.look.ca" "CNN.com" KeyValue 2 One KeyValue per column family (contents and anchor) Sparse: only columns with values are included per key

From KeyValues to HFiles Contains many sorted KeyValues Has a fixed upper size in MB Index is located at the end of the HFile. Used to quickly locate a single KeyValue

From HFiles to HTables Many HFiles create an HRegion Region is identified by start and end key When HRegions get too large, they are split and two new are created Many HRegions create an HTable HMaster knows the locations of HRegions

HBase Architecture HBase uses HDFS for data access Taken from http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

Hbase Operations Supports basic DBMS operations Put(row_key, column_key,timestamp,value) Get(row_key) and optionally column_key, timestamp, value Scan(start_row_key, end_row_key) No table joins!!! No multi-row transactions Atomic single-row writes Optional atomic single-row reads

Other NoSQL alternatives Cassandra Voldermort Dynamo CouchDB, MongoDB, SimpleDB, Hypertable. And many more: check http://en.wikipedia.org/wiki/nosql

MapReduce 1/3 A programming model A software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes Utilizes HDFS for input/output HDFS stores and MapReduce processes.

MapReduce 2/3 Problem is separated in two different phases, the Map and Reduce phase. Map: Non overlapping chunks (<key,value> records) of input data is assigned to separate processes (mappers) that emit a set of intermediate <key,value> results Reduce: Map results are fed to a usually smaller number of processes called reducers that summarize their input in a smaller number of <key,value> results

MapReduce 3/3

Example: Word Count 1/3 Count the number of times each word appears in a large set of documents Possible usage: find popular urls in log files Work Plan: Upload documents to HDFS Write a map and a reduce function Execute MapReduce job in Hadoop Get the job output in HDFS

Example: Word Count 2/3 map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result)

Example: Word Count 3/3 (d1, w1 w2 w4 ) (d2, w1 w2 w3 w4 ) (d3, w2 w3 w4 ) (d4, w1 w2 w3 ) (d5, w1 w3 w4 ) (d6, w1 w4 w2 w2) (d7, w4 w2 w1 ) (d8, w2 w2 w3 ) (d9, w1 w1 w3 w3 ) (d10, w2 w1 w4 w3 ) (w1, 2) (w2, 3) (w3, 2) (w4,3) (w1,3) (w2,4) (w3,2) (w4,3) (w1,3) (w2,3) (w3,4) (w4,1) (w1,2) (w2,3) (w1,3) (w2,4) (w1,3) (w2,3) (w3,2) (w4,3) (w3,2) (w4,3) (w3,4) (w4,1) (w1,7) (w2,15) (w3,8) (w4,7) M=3 mappers R=2 reducers

When should I use it? Good choice for Indexing log files Sorting vast amounts of data Image analysis Bad choice for Figuring π to 1,000,000 digits Calculating Fibonacci sequences MySQL replacement

Typical problems Log and/or clickstream analysis of various kinds Marketing analytics Machine learning and/or sophisticated data mining Image processing Processing of XML messages Web crawling and/or text processing General archiving, including of relational/tabular data, e.g. for compliance

Hadoop MapReduce Master/Slave architecture A JobTracker Master Runs together with NameNode Receives client job requests Schedules and monitors MR jobs Move computation near the data Speculative execution Many TaskTrackers Run together with DataNodes Perform I/O operations with DataNodes

Use cases 1/3 Large Scale Image Conversions 100 Amazon EC2 Instances, 4TB raw TIFF data 11 Million PDF in 24 hours and 240$ Internal log processing Reporting, analytics and machine learning Cluster of 1110 machines, 8800 cores and 12PB raw storage Open source contributors (Hive) Store and process tweets, logs, etc Open source contributors (hadoop-lzo)

Use cases 2/3 100.000 CPUs in 25.000 computers Content/Ads Optimization, Search index Machine learning (e.g. spam filtering) Open source contributors (Pig) Natural language search (through Powerset) 400 nodes in EC2, storage in S3 Open source contributors (!) to HBase ElasticMapReduce service On demand elastic Hadoop clusters for the Cloud

Use cases 3/3 ETL processing, statistics generation Advanced algorithms for behavioral analysis and targeting Used for discovering People you May Know, and for other apps 3X30 node cluster, 16GB RAM and 8TB storage Leading Chinese language search engine Search log analysis, data mining 300TB per week 10 to 500 node clusters

Questions