Pydoop: a Python MapReduce and HDFS API for Hadoop
|
|
|
- Carmella Jackson
- 10 years ago
- Views:
Transcription
1 Pydoop: a Python MapReduce and HDFS API for Hadoop Simone Leo CRS4 Pula, CA, Italy [email protected] Gianluigi Zanetti CRS4 Pula, CA, Italy [email protected] ABSTRACT MapReduce has become increasingly popular as a simple and efficient paradigm for large-scale data processing. One of the main reasons for its popularity is the availability of a production-level open source implementation, Hadoop, written in Java. There is considerable interest, however, in tools that enable Python programmers to access the framework, due to the language s high popularity. Here we present a Python package that provides an API for both the MapReduce and the distributed file system sections of Hadoop, and show its advantages with respect to the other available solutions for Hadoop Python programming, Jython and Hadoop Streaming. Categories and Subject Descriptors D.3.3 [Programming Languages]: Language Constructs and Features Modules, packages 1. INTRODUCTION In the past few years, MapReduce [16] has become increasingly popular, both commercially and academically, as a simple and efficient paradigm for large-scale data processing. One of the main reasons for its popularity is the availability of a production-level open source implementation, Hadoop [5], which also includes a distributed file system, HDFS, inspired by the Google File System [17]. Hadoop, a top-level Apache project, scales up to thousands of computing nodes and is able to store and process data in the order of the petabytes. It is widely used across a large number of organizations [2], most notably Yahoo, which is also the largest contributor [7]. It is also included as a feature by cloud computing environments such as Amazon Web Services [1]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC'10, June 20 25, 2010, Chicago, Illinois, USA. Copyright 2010 ACM /10/06...$ Hadoop is fully written in Java, and provides Java APIs to interact with both MapReduce and HDFS. However, programmers are not limited to Java for application writing. The Hadoop Streaming library (included in the Hadoop distribution), for instance, allows the user to provide the map and reduce logic through executable scripts, according to a fixed text protocol. This, in principle, allows to write an application using any programming language, but with restrictions both for what concerns the set of available features and the format of the data to be processed. Hadoop also includes a C library to interact with HDFS (interfaced with the Java code through JNI) and a C++ MapReduce API which takes advantage of the Hadoop Pipes package. Pipes splits the application-specific C++ code into a separate process, exchanging serialized objects with the framework via a socket. As a consequence, a C++ application is able to process any type of input data and has access to a large subset of the MapReduce components. Hadoop s standard distribution also includes Python examples that are, however, meant to be run on Jython [13], a Java implementation of the language which has several limitations compared to the official C implementation (CPython). In this work, we focused on building Python APIs for both the MapReduce and the HDFS parts of Hadoop. Python is an extremely popular programming language characterized by very high level data structures and good performances. Its huge standard library, complemented by countless thirdparty packages, allows it to handle practically any application domain. Being able to write MapReduce applications in Python constitutes, therefore, a major advantage for any organization that focuses on rapid application development, especially when there is a consistent amount of internally developed libraries that could be reused. The standard and probably most common way to develop Python Hadoop programs is to either take advantage of Hadoop Streaming or use Jython. In Hadoop Streaming, an executable Python script can act as the mapper or reducer, interacting with the framework through standard input and standard output. The communication protocol is a simple text-based one, with new line characters as record delimiters and tabs as key/value separators. Therefore, it cannot process arbitrary data streams, and the user directly controls only the map and reduce parts (i.e., you can t write a RecordReader or Partitioner). Jython is a Java implementation of the Python language, which allows a Python programmer to import and use Java packages. This, of course, allows access to the same features available to Java. However, this comes at a cost to the Python programmer: Jython is typically, at any given time, one or more releases older than CPython, it does not implement the full range of standard 819
2 library modules and does not allow to use modules written as C/C++ extensions. Usability of existing Python libraries for Hadoop with Jython is therefore only possible if they meet the above restrictions. Moreover, the majority of publicly available third-party packages are not compatible with Jython, most notably the numerical computation libraries (e.g., numpy [10]) that constitute an indispensable complement to Python for scientific programming. Our main goal was to provide access to as many as possible of the Hadoop features available to Java while allowing compact CPython code development with little impact on performances. Since the Pipes/C++ API seemed to meet these requirements well, we developed a Python package, Pydoop, by wrapping its C++ code with Boost.Python [15]. We also wrapped the C libhdfs code to made HDFS operations available to Python. To evaluate the package in terms of performances, we ran a series of tests, purposely running a very simple application in order to focus on interaction with the Java core framework (any application with a nontrivial amount of computation performed inside the mapper and/or reducer would likely run faster if written in C++, independently of how it interfaces with the framework), where we compare Pydoop with the other two solutions for Python application writing in Hadoop, Jython and Hadoop Streaming (with Python scripts), and also to Java and C++ to give a general idea of how much performance loss is to be expected when choosing Python as the programming language for Hadoop. Pydoop is currently being used in production for bioinformatics applications and other purposes [18]. It currently supports Python 2.5 and Python 2.6. The rest of the paper is organized as follows. After discussing related work, we introduce Pydoop s architecture and features; we then present a performance-wise comparison of our implementation with the other ones; finally, we give our conclusions and plans for future work. 2. RELATED WORK Efforts aimed at easing Python programming on Hadoop include Happy [6] and Dumbo [4]. These frameworks, rather than providing a Java-like Python API for Hadoop, are focused on building high-level wrappers that hide all job creation and submission details from the user. As far as MapReduce programming is concerned, they are built upon, respectively, Jython and Hadoop Streaming, and thus they suffer from the same limitations. Existing non-java HDFS APIs use Thrift [14] to make HDFS calls available to other languages [8] (including Python) by instantiating a Thrift server that acts as a gateway to HDFS. In contrast, Pydoop s HDFS module, being built as a wrapper around the C libhdfs code, is specific to Python but does not require a server to communicate with HDFS. Finally, there are several non-hadoop Mapreduce implementations available to write applications in languages other than Java. These include Starfish [12] (Ruby), Octopy [11] (Python) and Disco [3] (Python, framework written in Erlang). Since Hadoop is still the most popular and widely deployed implementation, however, we expect our Python bindings to be useful for the majority of MapReduce programmers. Figure 1: Hadoop pipes data flows. Hadoop communicates with user-supplied executables by means of a specialized protocol. Almost all components can be delegated to the user-supplied executables, but, at a minimum, it is necessary to provide the Mapper and Reducer classes. 3. ARCHITECTURE 3.1 MapReduce Jobs in Hadoop MapReduce [16] is a distributed computing paradigm tailored for the analysis of very large datasets. It operates on a input key/value pair stream that is converted into a set of intermediate key/value pairs by a user-defined map function; the user also provides a reduce function that merges together all intermediate values associated to a given key to yield the final result. In Hadoop, MapReduce is implemented according to a master-slave model: the master, which performs task dispatching and overall job monitoring, is called Job Tracker; the slaves, which perform the actual work, are called Task Trackers. A client launching a job first creates a JobConf object whose role is to set required (Mapper and Reducer classes, input and output paths) and optional job parameters (e.g., the number of mappers and/or reducers). The JobConf is passed on to the JobClient, which divides input data into InputSplits and sends job data to the Job Tracker. Task Trackers periodically contact the Job Tracker for work and launch tasks in separate Java processes. Feedback is provided in two ways: by incrementing a counter associated to some application-related variable or by sending a status report (an arbitrary text message). The RecordReader class is responsible for reading InputSplits and provide a recordoriented view to the Mapper. Users can optionally write their own RecordReader (the default one yields a stream of text lines). Other key components are the Partitioner, which assigns key/value pairs to reducers, and the RecordWriter, which writes output data to files. Again, these are optionally written by the users (the defaults are, respectively, to partition based on a hash value of the key and to write tabseparated output key/value pairs, one per line). 820
3 >>> import os >>> from pydoop.hdfs import hdfs >>> fs = hdfs("localhost", 9000) >>> fs.open_file("f",os.o_wronly).write(open("f").read()) Figure 3: A compact HDFS usage example. In this case, a local file is copied to HDFS. from pydoop.hdfs import hdfs Figure 2: Integration of Pydoop with C++. In Pipes, method calls flow from the framework through the C++ and the Pydoop API, ultimately reaching user-defined classes; Python objects are wrapped by Boost.Python and returned to the framework. In the HDFS wrapper, instead, function calls are initiated by Pydoop. 3.2 Hadoop Pipes Fig. 1 shows data flows in Hadoop Pipes. Hadoop uses a specialized class of tasks, Pipes tasks, to communicate with user-supplied executables by means of a protocol that uses persistent socket connections to exchange serialized objects. The C++ application provides a factory that is used by the framework to create the various components it needs (Mapper, Reducer, RecordReader, Partitioner...). Almost all Hadoop framework components can be overridden by C++ implementations, but, at a minimum, the factory should provide Mapper and Reducer object creation. Fig. 2 shows integration of Pydoop with the C++ code. In the Pipes wrapper, method calls flow from the framework through the C++ code and the Pydoop API, ultimately reaching user-defined classes; Python objects resulting from these calls are returned to the framework wrapped in Boost.Python structures. In the HDFS wrapper, the control flow is inverted: function calls are initiated by Pydoop and translated into their C equivalent by the Boost.Python wrapper; resulting objects are wrapped back and presented as Python objects to the application level. 4. FEATURES Pydoop allows to develop full-fledged MapReduce applications with HDFS access. Its key features are: access to most Mapreduce application components: Mapper, Reducer, RecordReader, RecordWriter, Partitioner; access to context object passed by the framework that allow to get JobConf parameters, set counters and report status; programming style similar to the Java and C++ APIs: developers define classes that are instantiated by the framework, with methods also called by the framework (compare this to the Streaming approach, where you have to manually handle the entire key/value stream); CPython implementation: any Python module can be used, either pure Python or C/C++ extension (this is not possible with Jython); HDFS access from Python. MB = float(2**20) def treewalker(fs, root_info): yield root_info if root_info["kind"] == "directory": for info in fs.list_directory(root_info["name"]): for item in treewalker(fs, info): yield item def usage_by_bs(fs, root): stats = {} root_info = fs.get_path_info(root) for info in treewalker(fs, root_info): if info["kind"] == "directory": continue bs = int(info["block_size"]) size = int(info["size"]) stats[bs] = stats.get(bs, 0) + size return stats def main(argv): fs = hdfs("localhost", 9000) root = fs.working_directory() for k, v in usage_by_bs(fs, root).iteritems(): print "%.1f %d" % (k/mb, v) fs.close() Figure 4: A more complex, although somewhat contrived, HDFS example. Here, a directory tree is walked recursively and statistics on file system usage by block size are built. 4.1 A Simple HDFS Example One of the strengths of Python is interactivity. Fig. 3 shows an almost one-liner that copies a file from the local file system to HDFS. Of course, in this case the Hadoop HDFS shell equivalent would be more compact, but a full API provides more flexibility. The Pydoop HDFS interface is written as a Boost.Python wrapper around the C libhdfs, itself a JNI wrapping of the Java code, so it essentially supports the same array of features. We also added few extensions, such as a readline method for HDFS files, in order to provide the Python user with an interface that is reasonably close to that of standard Python file objects. Fig. 4 shows a more detailed, even though somewhat contrived, example of a script that walks through a directory tree and builds statistics of HDFS usage by block size. This is an example of a useful operation that cannot be performed with the Hadoop HDFS shell and requires writing only a small amount of Python code. 821
4 from pydoop.pipes import Mapper, Reducer, Factory, runtask class WordCountMapper(Mapper): def map(self, context): words = context.getinputvalue().split() for w in words: context.emit(w, "1") class WordCountReducer(Reducer): def reduce(self, context): s = 0 while context.nextvalue(): s += int(context.getinputvalue()) context.emit(context.getinputkey(), str(s)) runtask(factory(wordcountmapper, WordCountReducer)) Figure 5: The simplest implementation of the classic word count example in Pydoop. 4.2 A Simple MapReduce Example Fig. 5 shows the simplest implementation of the classic word count example in Pydoop. All communication with the framework is handled through the context object. Specifically, through the context, Mapper objects get input key/value pairs (in this case the key, equal to the byte offset of the input file, is not needed) and emit intermediate key/value pairs; reducers get intermediate keys along with their associated set of values and emit output key/value pairs. Fig. 6 shows how to include counter and status updates in the Mapper and Reducer. Counters are defined by the user: they are usually associated with relevant application parameters (in this case, the mapper counts input words and the reducer counts output words). Status updates are simply arbitrary text messages that the application reports back to the framework. As shown in the code snippet, communication of counter and status updates happens through the context object. Fig. 7 shows how to implement a RecordReader for the word count application. The RecordReader processes the InputSplit, a raw bytes chunk from an input file, and divides it into key/value pairs to be fed to the Mapper. In this example, we show a Python reimplementation of Hadoop s default RecordReader, where keys are byte offsets with respect to the whole file and values are text lines: this RecordReader is therefore not specific to word count (although it is the one that the word count Mapper expects). Note that we are using the readline method we added to Pydoop HDFS file objects. Finally, fig. 8 shows how to implement the RecordWriter and Partitioner for the word count application. Again, these are Python versions of the corresponding standard generalpurpose Hadoop components, so they are actually not specific to word count. The RecordWriter is responsible for writing key/value pairs to output files: the standard, which is replicated here, is to write one key/value pair per line, separated by a configurable separator that defaults to the tab character. This example also shows how to use the Job- Conf object to retrieve configuration parameters: in this case we are reading standard Hadoop parameters, but an application is free to define any number of arbitrary options class WordCountMapper(Mapper): super(wordcountmapper, self). init (context) context.setstatus("initializing") self.inputwords = context.getcounter(wc, INPUT_WORDS) def map(self, context): k = context.getinputkey() words = context.getinputvalue().split() for w in words: context.emit(w, "1") context.incrementcounter(self.inputwords, len(words)) class WordCountReducer(Reducer): super(wordcountreducer, self). init (context) context.setstatus("initializing") self.outputwords = context.getcounter(wc, OUTPUT_WORDS) def reduce(self, context): s = 0 while context.nextvalue(): s += int(context.getinputvalue()) context.emit(context.getinputkey(), str(s)) context.incrementcounter(self.outputwords, 1) Figure 6: A word count implementation that includes counters and status updates: through these, the application writer can send information on its progress to the framework. class WordCountReader(RecordReader): super(wordcountreader, self). init () self.isplit = InputSplit(context.getInputSplit()) self.host, self.port, self.fpath = split_hdfs_path( self.isplit.filename) self.fs = hdfs(self.host, self.port) self.file = self.fs.open_file(self.fpath, os.o_rdonly) self.file.seek(self.isplit.offset) self.bytes_read = 0 if self.isplit.offset > 0: # read by reader of previous split discarded = self.file.readline() self.bytes_read += len(discarded) def next(self): if self.bytes_read > self.isplit.length: return (False, "", "") key = struct.pack(">q", self.isplit.offset+self.bytes_read) record = self.file.readline() if record == "": return (False, "", "") self.bytes_read += len(record) return (True, key, record) def getprogress(self): return min(float(self.bytes_read)/self.isplit.length, 1.0) Figure 7: Word Count RecordReader example. The RecordReader converts the byte-oriented view of the InputSplit to the record-oriented view needed by the Mapper. Here, we show some code snippets from a plug-in replacement of Hadoop s standard Java LineRecordReader, where keys are byte offsets with respect to the whole file and values (records) are text lines. 822
5 class WordCountWriter(RecordWriter): super(wordcountwriter, self). init (context) jc = context.getjobconf() jc_configure_int(self, jc, "mapred.task.partition", "part") jc_configure(self, jc, "mapred.work.output.dir", "outdir") jc_configure(self, jc, "mapred.textoutputformat.separator", "sep", "\t") outfn = "%s/part-%05d" % (self.outdir, self.part) host, port, fpath = split_hdfs_path(outfn) self.fs = hdfs(host, port) self.file = self.fs.open_file(fpath, os.o_wronly) def emit(self, key, value): self.file.write("%s%s%s\n" % (key, self.sep, value)) class WordCountPartitioner(Partitioner): def partition(self, key, numofreduces): reducer_id = (hash(key) & sys.maxint) % numofreduces return reducer_id Figure 8: RecordWriter and Partitioner examples for word count. The former is responsible for writing key/value pairs to output files, while the latter decides how to assign keys to reducers. Again, these are Python implementations of the corresponding standard components. Note how the JobConf is used to retrieve application parameters. whose values are read from the application s configuration file (exactly as in C++ pipes applications). The Partitioner decides how to assign keys to reducers: again, in this example we show the standard way of doing this, by means of a hash function of the key itself. The framework passes the total number of reduce tasks to the partition method as the second argument. 5. COMPARISON In the previous sections, we compared Pydoop to the other solutions for Hadoop application writing in terms of characteristics such as convenience, development speed, flexibility, etc. In this section we will present a performance comparison, obtained by running the classical word count example in the different implementations. We chose word count for two main reasons: it is a well-known, representative MapReduce example; it is simple enough to make comparisons between different languages sufficiently fair. We ran our tests on a cluster of 48 identical machines, each one equipped with two dual core 1.8GHz AMD Opterons, 4GB of RAM and two 160GB hard disks (one dedicated to the operating system and the other one to HDFS and MapReduce temporary local directories), connected trough Gb Ethernet. To run the word count example, we generated 20GB of random text data by sampling from a list of English words for spell checkers [9]. Specifically, we merged together englishwords.* lists from the SCOWL package including levels 10 through 70, for a total of 144,577 words (each word in the final 20 GB database appeared about 15 thousand times). The data generation itself has been developed as a Pydoop map-only MapReduce application. The input stream consisted of N lines of text containing a single integer that represents the amount of data that has to be generated. We made the word list available to all machines via the Hadoop Distributed Cache. Word sampling from the list was uniform, while we used a Gaussian line length with mean value equal to 120 characters and unitary variance. Since each mapper processes a subset of the input records, N was set equal to a multiple of the maximum map task capacity (that is, the number of nodes multiplied by the number of concurrent map tasks per node). Due to the relatively low amount of memory available on the test cluster, and to the fact that only one disk was available, we set the maximum number of concurrent tasks (both map and reduce) to two in each node, obtaining a total capacity of 96. Consequently, we configured the data generation application to distributed the random text over 96 files, with an HDFS block size of 128 MB. We ran the actual word count application, in all cases, with 192 mappers and 90 reducers. All tests described here were run with Hadoop , Java , Python and Jython (we did not use Jython 2.5 because it no longer supports jythonc). Our main goal was to compare Pydoop to the two other main options for Python Hadoop application writing: Jython and Hadoop Streaming (in the latter case, we used Python executable scripts for the mapper and the reducer). In order to see how Pydoop compares with the two main other language frameworks, we also compared the Pydoop implementation of word count to the Java and the C++ ones. Since there is no way to add a combiner executable to Streaming (although you can provide a Java one), we ran two separate test sessions: in the first one, we ran the official word count example (with the combiner class set equal to the reducer class) in the Java, C++, Pydoop and Jython versions; in the second one, we ran a word count without combiner for all implementations included in the previous session plus Streaming (the mapper and reducer scripts have been implemented in Python). Fig. 9 shows results for the first test session (timings averaged over five iterations of the same run): Java is the fastest, second best is C++, Pydoop and Jython are the slowest (Pydoop yielded slightly better results, but the two are comparable within the error range). The fact that Java has the best performances is not surprising: this is the standard way of writing Hadoop applications, where everything runs inside the framework. The C++ application, on the other hand, communicates with the Java framework through an external socket-based protocol, adding overhead. Pydoop, being in turn built upon the C++ code, obviously adds more overhead. Moreover, in general, C++ code is expected to be much faster than its Python equivalent, even for simple tasks as this one. The better integration of Jython with the framework is probably counterbalanced by the fact that it is generally slower than CPython. Fig. 10 shows results for the second test session (again, timings are averaged over five iterations of the same run). For all implementations except Streaming, performance ranks are similar to those of the previous session: again, Jython s and Pydoop s performances are equal within the error range, 823
6 2.5 with combiner -- average total elapsed times class WordCountReducer(Reducer): def reduce(self, context): context.emit(context.input_key, str(sum(context.itervalues()))) Figure 11: Pydoop word count reducer code written according to the new planned interface java c++ pydoop jython Figure 9: Total elapsed times, scaled by the average Java elapsed time (238s), for a word count on 20GB of data with Java, C++, Pydoop and Jython implementations. In this case we ran the official example, where the combiner class is set equal to the reducer class. We used 96 CPU cores distributed over 48 computing nodes. Timings are averaged over five iterations of the same run. 5 4 no combiner -- average total elapsed times this time slightly better for the former. The Hadoop streaming implementation is considerably slower, mostly because of the plain text serialization for each key/value pair. 6. CONCLUSIONS AND FUTURE WORK Pydoop is a Python MapReduce and HDFS API for Hadoop that allows object-oriented Java-style MapReduce programming in CPython. It constitutes a valid alternative to the other two main solutions for Python Hadoop programming, Jython and Hadoop Streaming-driven Python scripts. With respect to Jython, Pydoop has the advantage of being a CPython package, which means application writers have access to all Python libraries, either built-in or third-party, including any C/C++ extension module. Performance-wise, there is no significant difference between Pydoop and Jython. With respect to Streaming, there are several advantages: application writing is done through a consistent API that handles communication with the framework through the context object, while in Streaming you are forced to manually handle key/value passing through the standard input and output of the executable mapper and reducer script; you have access to almost all MapReduce components, while in Streaming you can only write code for the map and reduce functions; the text protocol used by Streaming imposes limits on data types; finally, performances with Streaming are considerably worse. Moreover, Pydoop also provides access to HDFS, allowing to perform nontrivial tasks with the compactness and speed of development of Python java c++ jython pydoop streaming Although it is already being used internally in production, Pydoop is still under development. One of the most relevant enhancements we plan to add in the near future is a more pythonic interface to objects and methods, in order to help Python programmers get familiar with it more easily. The features we plan to add include property access to keys and values and Python-style iterators for traversing the set of values associated with a given key. Fig. 11 shows how the word count reducer code will change after the aforementioned features will be added. Figure 10: Total elapsed times, scaled by the average Java elapsed time (338s), for a word count on 20GB of data with Java, C++, Jython, Pydoop and Hadoop Streaming (Python scripts) implementations. In this case we did not set a combiner class. We used 96 CPU cores distributed over 48 computing nodes. Timings are averaged over five iterations of the same run. 7. AVAILABILITY Pydoop is available at 8. ACKNOWLEDGMENTS The work described here was partially supported by the Italian Ministry of Research under the CYBERSAR project. 9. REFERENCES [1] Amazon Elastic MapReduce
7 [2] Applications and organizations using hadoop. [3] Disco. [4] Dumbo. [5] Hadoop. [6] Hadoop + Python = Happy. [7] Hadoop Common Credits. [8] Hadoop Distributed File System (HDFS) APIs in perl, python, ruby and php. [9] Kevin s Word List Page. [10] NumPy. [11] Octopy Easy MapReduce for Python. [12] Starfish. [13] The Jython Project. [14] Thrift. [15] D. Abrahams and R. Grosse-Kunstleve. Building hybrid systems with Boost.Python. C/C++ Users Journal, 21(7):29 36, [16] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI 04: 6th Symposium on Operating Systems Design and Implementation, [17] S. Ghemawat, H. Gobioff, and S. Leung. The Google file system. ACM SIGOPS Operating Systems Review, 37(5):43, [18] S. Leo, P. Anedda, M. Gaggero, and G. Zanetti. Using virtual clusters to decouple computation and data management in high throughput analysis applications. In Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, Pisa, Italy, February 2010, pages ,
Python MapReduce Programming with Pydoop
Distributed Computing CRS4 http://www.crs4.it EuroPython 2011 Acknowledgments Part of the MapReduce tutorial is based upon: J. Zhao, J. Pjesivac-Grbovic, MapReduce: The programming model and practice,
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
The Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen [email protected] Abstract. The Hadoop Framework offers an approach to large-scale
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Hadoop Streaming. Table of contents
Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Hadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
MapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Package hive. January 10, 2011
Package hive January 10, 2011 Version 0.1-9 Date 2011-01-09 Title Hadoop InteractiVE Description Hadoop InteractiVE, is an R extension facilitating distributed computing via the MapReduce paradigm. It
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Hadoop Design and k-means Clustering
Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Comparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Hadoop Distributed File System Propagation Adapter for Nimbus
University of Victoria Faculty of Engineering Coop Workterm Report Hadoop Distributed File System Propagation Adapter for Nimbus Department of Physics University of Victoria Victoria, BC Matthew Vliet
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Hadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
Cloud Computing. Chapter 8. 8.1 Hadoop
Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer
Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez (200950006) Sameer Kumar (200950031)
using Hadoop MapReduce framework Binoy A Fernandez (200950006) Sameer Kumar (200950031) Objective To demonstrate how the hadoop mapreduce framework can be extended to work with image data for distributed
Lecture 3 Hadoop Technical Introduction CSE 490H
Lecture 3 Hadoop Technical Introduction CSE 490H Announcements My office hours: M 2:30 3:30 in CSE 212 Cluster is operational; instructions in assignment 1 heavily rewritten Eclipse plugin is deprecated
Use of Hadoop File System for Nuclear Physics Analyses in STAR
1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
Facilitating Consistency Check between Specification and Implementation with MapReduce Framework
Facilitating Consistency Check between Specification and Implementation with MapReduce Framework Shigeru KUSAKABE, Yoichi OMORI, and Keijiro ARAKI Grad. School of Information Science and Electrical Engineering,
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Introduction to Cloud Computing
Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
Big Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
Data-intensive computing systems
Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following
Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
Processing Large Amounts of Images on Hadoop with OpenCV
Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200
Hadoop Learning Resources 1 Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200 Author: Hadoop Learning Resource Hadoop Training in Just $60/3000INR
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Storage Architectures for Big Data in the Cloud
Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013 Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas
MapReduce (in the cloud)
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)
Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Big Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. [email protected] http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
