Distributed Computing with Hadoop

Transcription

1 Distributed Computing with Hadoop Stefan Theußl, Albert Weichselbraun WU Vienna University of Economics and Business Augasse 2 6, 1090 Vienna [email protected] [email protected] 15. May 2009

2 Agenda Problem & Motivation The MapReduce Paradigm Distributed Text Mining in R Distributed Text Mining in Java Distributed Text Mining in Python Implementation Details Hadoop API Hadoop Streaming ewrt Discussion

3 Motivation Main motivation: large scale data processing Many tasks, i.e. we produce output data via processing lots of input data Want to make use of many CPUs Typically this is not easy (parallelization, synchronization, I/O, debugging, etc.) Need for an integrated framework

4 The MapReduce Paradigm

5 The MapReduce Paradigm Programming model inspired by functional language primitives Automatic parallelization and distribution Fault tolerance I/O scheduling Examples: document clustering, web access log analysis, search index construction,... Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI 04, 6th Symposium on Operating Systems Design and Implementation, pages , Hadoop ( developed by the Apache project is an open source implementation of MapReduce.

6 The MapReduce Paradigm Distributed Data Local Data Local Data Local Data Map Map Map Intermediate Data Partial Result Partial Result Partial Result Reduce Reduce Aggregated Result Figure: Conceptual Flow

7 The MapReduce Paradigm A MapReduce implementation like Hadoop typically provides a distributed file system (DFS): Master/worker architecture (Namenode/Datanodes) Data locality Map tasks are applied to partitioned data Map tasks scheduled so that input blocks are on same machine Datanodes read input at local disk speed Data replication leads to fault tolerance Application does not care whether nodes are OK or not

8 Hadoop Streaming Utility allowing to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer $HADOOP HOME/bin/hadoop jar $HADOOP HOME/hadoop-streaming.jar -input inputdir -output outputdir -mapper./mapper -reducer./reducer

9 Application: Text Mining in R

10 Why Distributed Text Mining? Highly interdisciplinary research field utilizing techniques from computer science, linguistics, and statistics Vast amount of textual data available in machine readable format: scientific articles, abstracts, books,... memos, letters,... online forums, mailing lists, blogs,... Data volumes (corpora) become bigger and bigger Steady increase of text mining methods (both in academia as in industry) within the last decade Text mining methods are becoming more complex and hence computer intensive Thus, demand for computing power steadily increases

11 Why Distributed Text Mining? High Performance Computing (HPC) servers available for a reasonable price Integrated frameworks for parallel/distributed computing available (e.g., Hadoop) Thus, parallel/distributed computing is now easier than ever Standard software for data processing already offer extensions to use this software

12 Text Mining in R tm Package Tailored for Plain texts, articles and papers Web documents (XML, SGML,... ) Surveys Methods for Clustering Classification Visualization

13 Text Mining in R I. Feinerer tm: Text Mining Package, 2009 URL R package version I. Feinerer, K. Hornik, and D. Meyer Text mining infrastructure in R Journal of Statistical Software, 25(5):1 54, March 2008 ISSN URL

14 Distributed Text Mining in R Motivation: Large data sets Corpus typically loaded into memory Operations on all elements of the corpus (so-called transformations) Available transformations: stemdoc(), stripwhitespace(), tmtolower(),...

15 Distributed Text Mining Strategies in R Possibilities: Text mining using tm and MapReduce (via Hadoop framework) Text mining using tm and MPI/snow 1 1 Luke Tierney (version on CRAN)

16 Distributed Text Mining in R Solution (Hadoop): Data set copied to DFS (DistributedCorpus) Only meta information about the corpus in memory Computational operations (Map) on all elements in parallel Work horse tmmap() Processed documents (revisions) can be retrieved on demand

17 Distributed Text Mining in R

18 Distributed Text Mining in R - Listing Mapper (called by tmmap): 1. hadoop _ generate _tm_ mapper <- function ( script, FUN,...){ 2 writelines ( sprintf ( #!/ usr / bin / env Rscript 3 ## load tm package 4 require (" tm ") 5 fun <- %s 6 ## read from stdin 7 input <- readlines ( file (" stdin ") ) 8 ## create object of class PlainTextDocument 9 doc <- new ( " PlainTextDocument ",. Data = input [ -1 L], 10 DateTimeStamp = Sys. time () ) 11 ## apply function on document 12 result <- fun ( doc ) 13 ## write key 14 writelines ( input [1 L] ) 15 ## write value 16 writelines ( Content ( result ) ) 17, FUN ), script ) 18 }

19 Distributed Text Mining in R Example: Stemming Erasing word suffixes to retrieve their radicals Reduces complexity Stemmers provided in packages Rstem 1 and Snowball 2 Data: Wizard of Oz book series ( 20 books, each containing lines of text 1 Duncan Temple Lang (version on Omegahat) 2 Kurt Hornik (version on CRAN)

20 Distributed Text Mining in R Workflow: Start Hadoop framework Put files into DFS Apply map functions (e.g., stemming of document) Retrieve output data

21 Distributed Text Mining in R Infrastructure Computers of PC Lab used as worker nodes 8 PCs with an Intel Pentium GHz and 1 GB of RAM Each PC has > 20 GB reserved for DFS Development platform: 8-core Power 6 shared memory system No cluster installation (yet) MapReduce framework Hadoop Implements MapReduce + DFS Local R and tm installation Code for R/Hadoop integration (upcoming package hive)

22 Benchmark Runtime Speedup Runtime [s] Speedup # CPU # CPU

23 Lessons Learned Problem size has to be sufficiently large Location of texts in DFS (currently: ID = file path) Thus, serialization difficult (how to update text IDs?) Remote file operation on DFS around 2.5 sec.

24 Application: Text Mining in Java

25 Text Mining in Java Frameworks GATE - General Architecture for Text Engineering (Cunningham, 2002) UIMA - Unstructured Information Management Architecture (IBM) OpenNLP WEKA (Witten & Frank, 2005) weblyzard

26 Text Mining in Java weblyzard Framework Tailored for Web documents (HTML) Common Text Formats (PDF, ODT, DOC,... ) Annotation Components: geographic tagging Keywords and co-occurring terms Named entitiy tagging Sentiment tagging Search and Indexing: Lucene

27 Text Mining in Python Frameworks Natural Language Toolkit (NLTK) easy Web Retrieval Toolkit (ewrt) weblyzard

28 Text Mining in Python Annotation Components: Keywords and co-occurring terms Language detection (short words, trigrams, bayes) Part of speech (POS) tagging Sentiment Data Sources Alexa web services (Web page rating) Amazon product reviews (sentiment) DBpedia (ontology data) Del.icio.us (social bookmarking) Facebook (in development) OpenCalais (named entity recognition) Scarlet (relation discovery) TripAdvisor (sentiment +) Wikipedia (disambiguation, synonyms) WordNet (disambiguation)

29 Implementation Details

30 Distributed Processing Approaches Hadoop API efficient build in mechanism for processing arbitrary objects write specialized code Hadoop-streaming programming language independend easy to use text only objects? ewrt transparent caching transparent object caching low overhead currently Python specific

31 Hadoop API - Mapper and Reducer Mapper and Reducer Implement the Mapper, Reducer Interface Inherited from MapReduceBase Objects Implement the Writable interface based on the DataInput, DataOutput Serializable readfields() initializes object write() serializes object

32 Hadoop API - Mapper and Reducer Mapper +map(k1:key,v1:value,outputcollector<k2, V2>:output,Reporter:reporter) Reducer +reduce(k2:key,iterator<v2>:values,outputcollector<k3, V3>:output,Reporter:reporter) MapReduceBase MyMapper JobConf MyReducer use use Writable +readfileds(dataoutput:in) +write(dataoutput:out) {K1 key: V1 value} -> {K2 key: V2 values} -> {K3 key: V3 value}

33 Hadoop API - Distributed File System - Listing 1 Configuration conf = new Configuration (); 2 FileSystem hadoopfs = FileSystem. get ( conf ); 4 // Input and Output files 5 Path infile = new Path ( argv [0]); 6 Path outfile = new Path ( argv [1]); 8 // Check whether a file exists 9 if (! hadoopfs. exists ( infile )) { 10 printandexit (" Input file not found "); 11 }

34 Hadoop API - Distributed File System - Access 2 FSD atainp utstre am in = hadoopfs. open ( infile ); 3 while (( bytesread = in. read ( buffer )) > 0) { 4 process ( buffer, bytesread ); 5 } 6 in. close ();

35 Hadoop Streaming - How to save your Objects? (1/4) Hadoop streaming: no fixed order on a per text line base Approaches do not use objects enclose meta data into the input straight forward solution use your program languages serialization facilities Java: Serializable Interface Python: cpickle R: serialize

36 Hadoop Streaming - How to save your Objects? (2/4) Pitfalls: Binary serialization formats More than one line serializations (Python) Performance Suggestions: Transparent compression (lightweight:.gz; compare: Google s Bigtable implementation) Encode the data stream using for instance Base64 (Compare: XMLRPC, SOAP,...)

37 Hadoop Streaming - How to save your Objects? (3/4) 1 from cpickle import dumps 3 class TestClass ( object ): 4 def init (self,a,b): 5 self.a=a 6 self.b=b 8 ti = TestClass (" Tom ", " Susan ") 9 print dumps (ti) ccopy_reg\nreconstructor\np1\n(c main \ntestclass\np2 c builtin \nobject\np3\nntrp4 (dp5\ns a \ns Tom \np6\nss b \ns Susan \np7\nsb.

38 Hadoop Streaming - How to save your Objects? (4/4) 1 from base64 import encodestring 2 from zlib import compress 4 serialize = lambda obj : encodestring ( compress ( obj )) Object Payload Serialized Size String 8 Bytes 33 Bytes Tuple/List/Set 8 Bytes 41/49/86 Bytes TestClass 8 Bytes 150 Bytes Reuters Article 3474 Bytes 1699 Bytes Table: Serialization Overhead

39 ewrt - Transparent Object Caching transparent minimal code modifications adaptor pattern disk or in memory cache Current use: complex computations access to remote resources complex database queries

40 ewrt - Transparent Object Caching original code: 1 def getcalaistags ( self, document_ id ): 2 return self. calais. fetch_tags ( document_id ) refined code: 1 def init ( self ): 2 self. cache = \ 3 Cache ("./ cache ", cache_nesting_level =2) 5 def getcalaistags ( self, document_ id ): 6 return self. cache. fetch ( \ 7 document_id, self. calais. fetch_tags )

41 Conclusion MapReduce has proven to be a useful abstraction Greatly simplifies distributed computing Developer focus on problem Implementations like Hadoop deal with messy details different approaches to facilitate Hadoop s infrastructure language- and use case dependent

42 Thank You for Your Attention! Stefan Theußl Department of Statistics and Mathematics URL: Albert Weichselbraun Department of Information Systems and Operations URL: WU Vienna University of Economics and Business Augasse 2 6, A-1090 Wien