Distributed Computing with Hadoop



Similar documents
Distributed Text Mining with tm

Simple Parallel Computing in R Using Hadoop

Distributed Text Mining with tm

Package hive. January 10, 2011

A Performance Analysis of Distributed Indexing using Terrier

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data and Apache Hadoop s MapReduce

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Open source Google-style large scale data analysis with Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Distributed Computing and Big Data: Hadoop and MapReduce

Package hive. July 3, 2015

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

Tutorial for Assignment 2.0

Big Data With Hadoop

A programming model in Cloud: MapReduce

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Jeffrey D. Ullman slides. MapReduce for data intensive computing

COURSE CONTENT Big Data and Hadoop Training

Open source large scale distributed data management with Google s MapReduce and Bigtable

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Introduction to Hadoop

BIG DATA SOLUTION DATA SHEET

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

How To Use Hadoop

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Distributed Framework for Data Mining As a Service on Private Cloud

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

MapReduce and Hadoop Distributed File System

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Fault Tolerance in Hadoop for Work Migration

Application Development. A Paradigm Shift

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

A Study of Data Management Technology for Handling Big Data

HiBench Introduction. Carson Wang Software & Services Group

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Getting Started with Hadoop with Amazon s Elastic MapReduce

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Parallel Processing of cluster by Map Reduce

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Big Systems, Big Data

Introduction to DISC and Hadoop

Hadoop and Map-Reduce. Swati Gore

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Implement Hadoop jobs to extract business value from large and varied data sets

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Introduction to Hadoop

Accelerating and Simplifying Apache

Cloud Computing at Google. Architecture

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Introduction to MapReduce and Hadoop

CS54100: Database Systems

GraySort on Apache Spark by Databricks

Hadoop IST 734 SS CHUNG

Scaling Out With Apache Spark. DTL Meeting Slides based on

Database Scalability and Oracle 12c

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

MapReduce. Tushar B. Kute,

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

How To Scale Out Of A Nosql Database

BIG DATA, MAPREDUCE & HADOOP

Apache Hadoop. Alexandru Costan

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Xiaoming Gao Hui Li Thilina Gunarathne

MAPREDUCE Programming Model

Extreme Computing. Hadoop MapReduce in more detail.

Mrs: MapReduce for Scientific Computing in Python

Transcription:

Distributed Computing with Hadoop Stefan Theußl, Albert Weichselbraun WU Vienna University of Economics and Business Augasse 2 6, 1090 Vienna Stefan.Theussl@wu.ac.at Albert.Weichselbraun@wu.ac.at 15. May 2009

Agenda Problem & Motivation The MapReduce Paradigm Distributed Text Mining in R Distributed Text Mining in Java Distributed Text Mining in Python Implementation Details Hadoop API Hadoop Streaming ewrt Discussion

Motivation Main motivation: large scale data processing Many tasks, i.e. we produce output data via processing lots of input data Want to make use of many CPUs Typically this is not easy (parallelization, synchronization, I/O, debugging, etc.) Need for an integrated framework

The MapReduce Paradigm

The MapReduce Paradigm Programming model inspired by functional language primitives Automatic parallelization and distribution Fault tolerance I/O scheduling Examples: document clustering, web access log analysis, search index construction,... Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI 04, 6th Symposium on Operating Systems Design and Implementation, pages 137 150, 2004. Hadoop (http://hadoop.apache.org/core/) developed by the Apache project is an open source implementation of MapReduce.

The MapReduce Paradigm Distributed Data Local Data Local Data Local Data Map Map Map Intermediate Data Partial Result Partial Result Partial Result Reduce Reduce Aggregated Result Figure: Conceptual Flow

The MapReduce Paradigm A MapReduce implementation like Hadoop typically provides a distributed file system (DFS): Master/worker architecture (Namenode/Datanodes) Data locality Map tasks are applied to partitioned data Map tasks scheduled so that input blocks are on same machine Datanodes read input at local disk speed Data replication leads to fault tolerance Application does not care whether nodes are OK or not

Hadoop Streaming Utility allowing to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer $HADOOP HOME/bin/hadoop jar $HADOOP HOME/hadoop-streaming.jar -input inputdir -output outputdir -mapper./mapper -reducer./reducer

Application: Text Mining in R

Why Distributed Text Mining? Highly interdisciplinary research field utilizing techniques from computer science, linguistics, and statistics Vast amount of textual data available in machine readable format: scientific articles, abstracts, books,... memos, letters,... online forums, mailing lists, blogs,... Data volumes (corpora) become bigger and bigger Steady increase of text mining methods (both in academia as in industry) within the last decade Text mining methods are becoming more complex and hence computer intensive Thus, demand for computing power steadily increases

Why Distributed Text Mining? High Performance Computing (HPC) servers available for a reasonable price Integrated frameworks for parallel/distributed computing available (e.g., Hadoop) Thus, parallel/distributed computing is now easier than ever Standard software for data processing already offer extensions to use this software

Text Mining in R tm Package Tailored for Plain texts, articles and papers Web documents (XML, SGML,... ) Surveys Methods for Clustering Classification Visualization

Text Mining in R I. Feinerer tm: Text Mining Package, 2009 URL http://cran.r-project.org/package=tm R package version 0.3-3 I. Feinerer, K. Hornik, and D. Meyer Text mining infrastructure in R Journal of Statistical Software, 25(5):1 54, March 2008 ISSN 1548-7660 URL http://www.jstatsoft.org/v25/i05

Distributed Text Mining in R Motivation: Large data sets Corpus typically loaded into memory Operations on all elements of the corpus (so-called transformations) Available transformations: stemdoc(), stripwhitespace(), tmtolower(),...

Distributed Text Mining Strategies in R Possibilities: Text mining using tm and MapReduce (via Hadoop framework) Text mining using tm and MPI/snow 1 1 Luke Tierney (version 0.3-3 on CRAN)

Distributed Text Mining in R Solution (Hadoop): Data set copied to DFS (DistributedCorpus) Only meta information about the corpus in memory Computational operations (Map) on all elements in parallel Work horse tmmap() Processed documents (revisions) can be retrieved on demand

Distributed Text Mining in R

Distributed Text Mining in R - Listing Mapper (called by tmmap): 1. hadoop _ generate _tm_ mapper <- function ( script, FUN,...){ 2 writelines ( sprintf ( #!/ usr / bin / env Rscript 3 ## load tm package 4 require (" tm ") 5 fun <- %s 6 ## read from stdin 7 input <- readlines ( file (" stdin ") ) 8 ## create object of class PlainTextDocument 9 doc <- new ( " PlainTextDocument ",. Data = input [ -1 L], 10 DateTimeStamp = Sys. time () ) 11 ## apply function on document 12 result <- fun ( doc ) 13 ## write key 14 writelines ( input [1 L] ) 15 ## write value 16 writelines ( Content ( result ) ) 17, FUN ), script ) 18 }

Distributed Text Mining in R Example: Stemming Erasing word suffixes to retrieve their radicals Reduces complexity Stemmers provided in packages Rstem 1 and Snowball 2 Data: Wizard of Oz book series (http://www.gutenberg.org) 20 books, each containing 1529 9452 lines of text 1 Duncan Temple Lang (version 0.3-0 on Omegahat) 2 Kurt Hornik (version 0.0-3 on CRAN)

Distributed Text Mining in R Workflow: Start Hadoop framework Put files into DFS Apply map functions (e.g., stemming of document) Retrieve output data

Distributed Text Mining in R Infrastructure Computers of PC Lab used as worker nodes 8 PCs with an Intel Pentium 4 CPU @ 3.2 GHz and 1 GB of RAM Each PC has > 20 GB reserved for DFS Development platform: 8-core Power 6 shared memory system No cluster installation (yet) MapReduce framework Hadoop Implements MapReduce + DFS Local R and tm installation Code for R/Hadoop integration (upcoming package hive)

Benchmark Runtime Speedup Runtime [s] 500 1500 2500 3500 Speedup 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 # CPU 1 2 3 4 5 6 7 8 # CPU

Lessons Learned Problem size has to be sufficiently large Location of texts in DFS (currently: ID = file path) Thus, serialization difficult (how to update text IDs?) Remote file operation on DFS around 2.5 sec.

Application: Text Mining in Java

Text Mining in Java Frameworks GATE - General Architecture for Text Engineering (Cunningham, 2002) UIMA - Unstructured Information Management Architecture (IBM) OpenNLP WEKA (Witten & Frank, 2005) weblyzard

Text Mining in Java weblyzard Framework Tailored for Web documents (HTML) Common Text Formats (PDF, ODT, DOC,... ) Annotation Components: geographic tagging Keywords and co-occurring terms Named entitiy tagging Sentiment tagging Search and Indexing: Lucene

Text Mining in Python Frameworks Natural Language Toolkit (NLTK) easy Web Retrieval Toolkit (ewrt) weblyzard

Text Mining in Python Annotation Components: Keywords and co-occurring terms Language detection (short words, trigrams, bayes) Part of speech (POS) tagging Sentiment Data Sources Alexa web services (Web page rating) Amazon product reviews (sentiment) DBpedia (ontology data) Del.icio.us (social bookmarking) Facebook (in development) OpenCalais (named entity recognition) Scarlet (relation discovery) TripAdvisor (sentiment +) Wikipedia (disambiguation, synonyms) WordNet (disambiguation)

Implementation Details

Distributed Processing Approaches Hadoop API efficient build in mechanism for processing arbitrary objects write specialized code Hadoop-streaming programming language independend easy to use text only objects? ewrt transparent caching transparent object caching low overhead currently Python specific

Hadoop API - Mapper and Reducer Mapper and Reducer Implement the Mapper, Reducer Interface Inherited from MapReduceBase Objects Implement the Writable interface based on the DataInput, DataOutput Serializable readfields() initializes object write() serializes object

Hadoop API - Mapper and Reducer Mapper +map(k1:key,v1:value,outputcollector<k2, V2>:output,Reporter:reporter) Reducer +reduce(k2:key,iterator<v2>:values,outputcollector<k3, V3>:output,Reporter:reporter) MapReduceBase MyMapper JobConf MyReducer use use Writable +readfileds(dataoutput:in) +write(dataoutput:out) {K1 key: V1 value} -> {K2 key: V2 values} -> {K3 key: V3 value}

Hadoop API - Distributed File System - Listing 1 Configuration conf = new Configuration (); 2 FileSystem hadoopfs = FileSystem. get ( conf ); 4 // Input and Output files 5 Path infile = new Path ( argv [0]); 6 Path outfile = new Path ( argv [1]); 8 // Check whether a file exists 9 if (! hadoopfs. exists ( infile )) { 10 printandexit (" Input file not found "); 11 }

Hadoop API - Distributed File System - Access 2 FSD atainp utstre am in = hadoopfs. open ( infile ); 3 while (( bytesread = in. read ( buffer )) > 0) { 4 process ( buffer, bytesread ); 5 } 6 in. close ();

Hadoop Streaming - How to save your Objects? (1/4) Hadoop streaming: no fixed order on a per text line base Approaches do not use objects enclose meta data into the input straight forward solution use your program languages serialization facilities Java: Serializable Interface Python: cpickle R: serialize

Hadoop Streaming - How to save your Objects? (2/4) Pitfalls: Binary serialization formats More than one line serializations (Python) Performance Suggestions: Transparent compression (lightweight:.gz; compare: Google s Bigtable implementation) Encode the data stream using for instance Base64 (Compare: XMLRPC, SOAP,...)

Hadoop Streaming - How to save your Objects? (3/4) 1 from cpickle import dumps 3 class TestClass ( object ): 4 def init (self,a,b): 5 self.a=a 6 self.b=b 8 ti = TestClass (" Tom ", " Susan ") 9 print dumps (ti) ccopy_reg\nreconstructor\np1\n(c main \ntestclass\np2 c builtin \nobject\np3\nntrp4 (dp5\ns a \ns Tom \np6\nss b \ns Susan \np7\nsb.

Hadoop Streaming - How to save your Objects? (4/4) 1 from base64 import encodestring 2 from zlib import compress 4 serialize = lambda obj : encodestring ( compress ( obj )) Object Payload Serialized Size String 8 Bytes 33 Bytes Tuple/List/Set 8 Bytes 41/49/86 Bytes TestClass 8 Bytes 150 Bytes Reuters Article 3474 Bytes 1699 Bytes Table: Serialization Overhead

ewrt - Transparent Object Caching transparent minimal code modifications adaptor pattern disk or in memory cache Current use: complex computations access to remote resources complex database queries

ewrt - Transparent Object Caching original code: 1 def getcalaistags ( self, document_ id ): 2 return self. calais. fetch_tags ( document_id ) refined code: 1 def init ( self ): 2 self. cache = \ 3 Cache ("./ cache ", cache_nesting_level =2) 5 def getcalaistags ( self, document_ id ): 6 return self. cache. fetch ( \ 7 document_id, self. calais. fetch_tags )

Conclusion MapReduce has proven to be a useful abstraction Greatly simplifies distributed computing Developer focus on problem Implementations like Hadoop deal with messy details different approaches to facilitate Hadoop s infrastructure language- and use case dependent

Thank You for Your Attention! Stefan Theußl Department of Statistics and Mathematics email: Stefan.Theussl@wu.ac.at URL: http://statmath.wu.ac.at/~theussl Albert Weichselbraun Department of Information Systems and Operations email: Albert.Weichselbraun@wu.ac.at URL: http://www.ai.wu.ac.at/~aweichse WU Vienna University of Economics and Business Augasse 2 6, A-1090 Wien