PARALLEL IMAGE DATABASE PROCESSING WITH MAPREDUCE AND PERFORMANCE EVALUATION IN PSEUDO DISTRIBUTED MODE
|
|
|
- Shona Powell
- 9 years ago
- Views:
Transcription
1 International Journal of Electronic Commerce Studies Vol.3, No.2, pp , 2012 doi: /ijecs.1092 PARALLEL IMAGE DATABASE PROCESSING WITH MAPREDUCE AND PERFORMANCE EVALUATION IN PSEUDO DISTRIBUTED MODE Muneto Yamamoto Kyushu University Motooka 744, Nishi-Ku, Fukuoka-Shi, , Japan Kunihiko Kaneko Kyushu University Motooka 744, Nishi-Ku, Fukuoka-Shi, , Japan ABSTRACT With recent improvements in camera performance and the spread of low-priced and lightweight video cameras, a large amount of video data is generated, and stored in database form. At the same time, there are limits on what can be done to improve the performance of single computers to make them able to process large-scale information, such as in video analysis. Therefore, an important research topic is how to perform parallel distributed processing of a video database by using the computational resource in a cloud environment. At present, the Apache Hadoop distribution for open-source cloud computing is available from MapReduce 1. In the present study, we report our results on an evaluation of performance, which remains a problem for video processing in distributed environments, and on parallel experiments using MapReduce on Hadoop 2. Keywords: Hadoop, MapReduce, Image Processing, Sequential Image Database 1. INTRODUCTION With the spread of cloud computing and network techniques and equipment in recent years, a large amount of data collected in a variety of social situations have been accumulated, and so the need for analysis
2 212 International Journal of Electronic Commerce Studies techniques to take advantage of useful information that can be extracted from such data sets is increasing. This is also true for video data, in which images are sequential and the data includes the associated time and frame information of each frame. Today, because video cameras are set up to perform surveillance of moving objects such as pedestrians and vehicles, a large amount of video data is generated, and stored in database form. To improve these video database systems, which hold a large amount of video and related data, parallel processing using CPUs and disks at multiple sites is an important area of research. Also, when considering operations such as search and other types of analysis of video images recorded by video camera and stored in a database, there are limits on what can be done to improve the performance of single computers to make them able to process large-scale information. Therefore, the advantages of parallel distributed processing of a video database by using the computational resources of a cloud computing environment should be considered. In addition, if computational resources can be secured easily and relatively inexpensively, then cloud computing is suitable for handling large video databases at low cost. Hadoop, as a mechanism for processing large numbers of databases by parallel and distributed computing has been recognized as promising. Nowadays, for reasons such as ease of programming, by using the function MapReduce on the Hadoop system, open-source cloud-based systems that can process data across multiple machines in a distributed environment have been studied for their application to various database operations. In fact, Hadoop is in use all over the world 3. Studies using Hadoop have been performed to treat one file as a text data file or multiple files as a single file unit, such as for the analysis of large volumes of DNA sequence data, converting the data of a large number of still images to PDF format, and carrying out feature selection/extraction in astronomy 4. These examples demonstrate the usefulness of this system, which is due to its having the ability to run multiple processes in parallel for load balancing and task management. The rest of this paper is organized as follows. Section 2 provides the background knowledge related to this work, including an overview of the MapReduce programming model and why MapReduce is deployed in a distributed environment. In section 3, we discuss the architecture of our system and problems arising when processing video databases by using MapReduce on Hadoop in a distributed environment. Then, we give an overview on our experimental methodology and present the results of our experiments in section 4. Finally, we conclude the paper and propose future work in section 5.
3 Muneto Yamamoto and Kunihiko Kaneko HADOOP STREAMING FOR DATABSE PROCESSING 2.1 Parallel and Distributed Processing on Hadoop As the structure of the system, Hadoop consists of two components, the Hadoop Distributed File System (HDFS) and MapReduce, performing distributed processing by single-master and multiple-slave servers. There are two elements of MapReduce, namely JobTracker and TaskTracker, and two elements of HDFS, namely DataNode and NameNode. In Figure 1, the configuration of these elements of MapReduce and HDFS on Hadoop are indicated. There is also a mechanism that checks the metadata for NameNode. (A) JobTracker JobTracker manages cluster resources and job scheduling to and monitoring on separate components. Figure 1. Structure with elements of MapReduce and HDFS (B) TaskTracker TaskTracker is a slave node daemon in the cluster that accepts tasks and returns the results after executing tasks received by JobTracker. (C) NameNode An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. NameNode executes file system name space operations, such as opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
4 214 International Journal of Electronic Commerce Studies (D) DataNode The cluster also has a number of DataNodes, usually one per node in the cluster. DataNodes manage the storage that is attached to the nodes on which they run. DataNodes also perform block creation, deletion, and replication in response to direction from NameNode. (E) SecondaryNameNode SecondaryNameNode is a helper to the primary NameNode. Secondary is responsible for supporting periodic checkpoints of the HDFS metadata. 2.2 Hadoop Distributed File System (HDFS) HDFS is designed to reliably store very large files across machines in a large cluster. It is inspired by the Google File System. HDFS is composed of NameNode and DataNode. HDFS stores each file as a sequence of blocks (currently 64 MB by default) with all blocks in a file the same size except for the last block. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. Files in HDFS are write-once and can have only one writer at any given time. 2.3 MapReduce MapReduce (implemented on Hadoop) is a framework for parallel distributed processing large volumes of data. In programming using MapReduce, it is possible to perform parallel distributed processing by writing programs involving the following three steps: Map, Shuffle, and Reduce. Figure 2 shows an example of the flow when Map and Reduce processes are performed. Because MapReduce automatically performs inter-process communication between Map and Reduce processes, and maintain load balancing of the processes.
5 Muneto Yamamoto and Kunihiko Kaneko 215 Large Data test1.jpg test2.jpg test3.jpg testn.jpg Divided Data 1 Divided Divided Data 1 Data 2 Divided Data 2 Divided Data N Divided MapFunction Map Map Map Shuffle Figure 2. Processes performing the map and reduce phases Data 3 1. Map concept of data processing The Map function takes a key-value pair <K, V> as the input and generates one or multiple pairs <K, V > as the intermediate output. 2. Shuffle concept of data processing After the Map phase produces the intermediate key-value pair or key-value pairs, they are efficiently and automatically grouped by key by the Hadoop system in preparation for the Reduce phase. 3. Reduce concept of data processing Reduce Function Reduce Reduce Output for Data 1 Output Results for Output Data 1 for Data 2 Output Results for Data 2 The Reduce function takes as the input a <K, LIST V > pair, where LIST V is a list of all V values that are associated with a given key K. The Reduce function produces an additional key-value pair as the output. By combining multiple Map and Reduce processes, we can accomplish complex tasks which cannot be done via a single Map and Reduce execution. Figures 3 and 4 respectively show the key-value data model and a Wordcount example of MapReduce. test1.jpg test2.jpg test3.jpg testn.jpg map: <key, value> list <key, value > shuffle: list <key, value > {<key, list(value )>} reduce: {<key, list(value )>} list(value ) Figure 3. Key-value data model of MapReduce
6 216 International Journal of Electronic Commerce Studies Figure 4. Wordcount example of MapReduce 2.4 Hadoop Streaming Hadoop is an open-source implementation of the MapReduce platform and distributed file system. The Hadoop system is written in Java. Hadoop Streaming is a utility that comes with the Hadoop distribution and that can be used to invoke streaming programs that are not written in Java (such as Ruby, Perl, Python, PHP, R, or C++). Using this utility, we can execute database programs written in the Ruby programming language. The utility also allows the user to create and run Map and Reduce jobs with any executable programs or scripts as the mapper and the reducer. Figure 5 shows an example of an execution of a program when using Hadoop Streaming and a description of the Map and Reduce functions in the Ruby programming language. Key-value pairs can be specified to depend on the input output formats../bin/hadoop jar contrib/hadoop streaming.jar -input myinputdirs [HDFS Path] -output myoutputdir [HDFS Path] -mapper map.rb [map program file path] -reduce reduce.rb [reduce program file path] -file filename [local file system path] //map.rb def map(key, value, output, reporter) # Mapper code end //reduce.rb def reducer(key, value, output, reporter) # Reducer code end -cachefile filenameuri [URI to the file that you have already uploaded to HDFS] -inputformat JavaClassName -outputformat JavaClassName [Input format should return key/value pairs of Text class] [output format should return key/value pairs of Text class] Figure 5. Usage example of the options for Hadoop Streaming using the Ruby programming language on Hadoop
7 Muneto Yamamoto and Kunihiko Kaneko VIDEO DATABASE PROCESSING ON HADOOP 3.1 MapReduce for Video Database Processing The Map and Reduce functions of MapReduce are both defined with respect to data structured in key-value pairs. In short, we can perform distributed processing by creating key-value pairs in MapReduce form. However, for unstructured data such as video data, it can be assumed that it is more difficult to create key-value pairs and perform the processing than processing structured data. In this experiment, in order to use the Ruby programming language, we utilize an extension package of Hadoop, namely Hadoop Streaming. With a view to performing parallel distributed processing on MapReduce forms, we need to create programs to be used as Map and Reduce functions in the Ruby programming language. Video database processing is performed by splitting the data in a video database and creating key-value pairs. For example, the frame number can be used as a key for a video frame. In the case of parallel processing of a video frame, the video frame is divided into multiple parts, and the part numbers can be the keys (identifiers) for these different parts. Sorting is carried out using the key number, and joining separated frames or separated parts is performed by the Reduce function. Figure 6 shows an example of processing flow using MapReduce. In this figure, each video frame is divided into four parts, and each part has a unique key number. Figure 6. Image processing flow using MapReduce 3.2 MapReduce Processing of Video Database We implemented the following video database processing steps using Hadoop Streaming. Multiple sequential video frames are input, and image processing is performed for each video frame. We implemented a Map function for image processing. The input of the Map function is a single video frame, and the Map function produces one video frame as output.
8 218 International Journal of Electronic Commerce Studies We process video frames in parallel with the slave servers, which use HDFS. The video database is stored in HDFS. Each Map process also outputs to HDFS. Figure 7 shows an example of video database processing flow using the Map function and HDFS. Multiple sequential video frames are input, and image processing is performed for each video frame. Results of image processing are output to HDFS. Figure 7. Sequential video frame processing flow using Map and HDFS 4. EXPERIMENT In the experiment, we use a video database that contains a set of video data collected in the area of Nishi-ku in Fukuoka City and Usuki in Oita Prefecture. The experimental environment is as follows: OS: Ubuntu 11.04, CPU: Intel core i5 2.8 GHz, Memory: 4 GB.
9 Muneto Yamamoto and Kunihiko Kaneko 219 The software versions are as follows: Hadoop , Ruby 1.8.7, and Octave The image size is as follows: pixels. The experiment is described as follows. We first create a grayscale image of the original image in parallel by using MapReduce. Then, features of the grayscale image are extracted in parallel by using MapReduce. Next, we recall a sub-program written in the Octave language (a high-level interpreted language used for mathematical model fitting, signal processing, and image processing) from a Ruby program. We configured the MapReduce system as a pseudo-distributed mode. Hadoop is composed of a master server which manages slave servers, which perform the actual image processing. Master and slave servers actually run on the same server for this configuration of the MapReduce system (pseudo-distributed mode). The number of copies of video data is set to 1. The parallel processing of the video database using Hadoop Streaming is distributed over all of the cores of a CPU of a single machine. 4.1 Grayscale Images Created by MapReduce In this experiment, we create a grayscale image of the original image in parallel by using MapReduce. First of all, we perform the processing of dividing the image in the Map tasks. Next, the Reduce task accepts the processed images from the Map tasks, combines the divided image, and outputs the result image to HDFS. We used grayscale creation as an example of image processing using MapReduce. In making the grayscale image from the original, we convert the RGB values to the NTSC color space as an yiqmap that contains the equivalent NTSC luminance (Y) and chrominance (I and Q) color components as columns, where Y, I, and Q are defined as follows: Y I Q R G B (1) Here, we specify the number of Map tasks as 4. Figure 8 shows the results of JobTracker when processing the video database, and shows that the number of Map tasks is 4 and the number of Reduce task is 1. Figure 9 shows the results given by a Hadoop Web interface program when
10 220 International Journal of Electronic Commerce Studies processing image files with MapReduce. Here, we specify the output file on HDFS when we perform the re-combination of the 4-divided image from usuki avi _divide1.jpg to usuki avi _divide4.jpg. Figure 10 shows the Task ID, the processing time from start to end, and the status of each TaskTracker on Hadoop in the experiment. Due to perform considerations, the maximum number of slave processes allowed to be executed is set to be 2, and then the tasks are performed in pairs, for example, task_ _0004_m_ and task_ _0004_m_000001, followed by task_ _0004_m_ and task_ _0004_m_ As a result, we confirmed a processing time of 4 sec and that two tasks can be performed at the same time in parallel for this experimental setup. Figure 8. Results of JobTracker when processing an image Figure 9. Results shown by a Hadoop Web interface program during grayscale video image creation using MapReduce
11 Muneto Yamamoto and Kunihiko Kaneko 221 Figure 10. Status of each TaskTracker on Hadoop when processing image files with MapReduce 4.2 Feature Extraction of Video Images by Using MapReduce In this experiment, we create video images that depict extracted features in parallel by using MapReduce. First of all, we input multiple video frames and perform feature extraction processes using Map function in parallel. In the parallel processing, multiple slave servers use a video database stored in HDFS and output the processing results to HDFS. For the process of extracting features of sequential video frames after multiple video frames are input to slave servers that execute a Map function, only the Map function is used to output a significant number of features for each image. First, when a pixel of a video frame contains a significant amount of features, such as corners and edges, a new pixel is generated in the output image, and the pixel value is set to be a feature value. Otherwise, a black pixel is generated in the output image. Let F and L be the original video frame and a smoothed image of F by a Gaussian distribution, respectively. We denote the individual pixels of L as l(i, j). Then, in order to perform feature extraction, we use the Prewitt filter, which detects edges and lines in an image. In this process, the filter obtains the horizontal and vertical contours. The filter coefficient is assumed to be a 3 3 matrix, from which an auto-correlation matrix is created. The image L is differentiated once in each of the horizontal and vertical directions, where the notation l i and l j is used for the respective derivatives. In addition, A, B, and C are defined as follows: l l i j l k 2 lk 2 lk lk, A l i, B l j, C l l (2) i j ij i j ij 2 The auto-correlation matrix, denoted by M, is given in terms of A, B, and C as follows:
12 222 International Journal of Electronic Commerce Studies A M C C B Then, the feature points of the Harris corner detection are extracted using the value c(i, j) of each pixel 5, 6 : c( i, j) M (3) 2 det( M ) k(tr( )) (4) Det and tr represent the determinant function and the sum of the diagonal elements, respectively. Here, k is an adjustable parameter and is generally taken to be within the range 0.04 to In our experiment, k is taken to be In addition, the eigenvalues λ 1 and λ 2 are obtained as follows: det( M ) 2 12 AB C, tr( M ) 1 2 A B (5) Threshold T is used to decide which c(i, j) are used in the output image. A pixel value in the output image is set to be 0 if c(i, j) < T. The absolute value of c(i, j) as calculated from the two eigenvalues λ 1 and λ 2 is denoted by R(i, j). In addition, the maximum and minimum values max(r(i, j)) and min(r(i, j)) are normalized to 0 and 255, respectively, in the output image. The pixel values in the output image have a 256-level range from 0 to 255. Using this method, we create a mask image with a threshold using a feature extraction image. Figure 11 shows examples of the original image (left), feature extraction image (center), and mask image (right). The feature extraction image is the result of processing the original image using the Harris corner detection, and the mask image is the result of processing the feature extraction image taking advantage of the threshold T. Figure 11. Examples of an input image, feature extraction image, and output image. The output image is used as the mask image
13 Muneto Yamamoto and Kunihiko Kaneko 223 For the Map processing, we input a sequence of eight images and perform the process. As a result, we output eight corresponding image files. The output image files are used for masking the input image. Here, we specify the output file on HDFS when we perform image processing of eight images, for example, as the eight images usuki avi jpg through usuki avi jpg. Figure 12 shows the MapReduce status in halfway through the execution; at this point, six of the eight tasks have been performed and 25% of the processing has been completed. Figure 13 shows Task ID, processing time from start to end, and status of each TaskTracker on Hadoop. Due to perform considerations, the maximum number of slave processes executed is set to be 2, two tasks are performed, such as task_ _0007_m_ and task_ _0007_m_000001, and then the eight tasks are performed in pairs in parallel, for example, task_ _0007_m_ and task_ _0007_m_000001, followed by _0007_m_ and task_ _0007_m_ As a result, we achieved a processing time of 2 min 27 sec. Figure 14 shows a MapReduce status report used to confirm the successful termination of MapReduce tasks. Here, we specify that the number of Map tasks is 8 and the number of Reduce tasks is 0. Figure 12. MapReduce status in the middle of execution
14 224 International Journal of Electronic Commerce Studies Figure 13. Status of each TaskTracker on Hadoop Figure 14. MapReduce status after termination of execution The total number of Map tasks is set to be 8, and the maximum number of slave processes executed is set to be either 2, 4, 6, or 8 in our experimental test. The creation time of the mask image was 2 minutes and 27 seconds when we performed the slave process in parallel using 2 processor-cores. Without MapReduce, processing time of the same task averages 2 minutes and 6 seconds, implying a favorable difference of about 21 seconds. We assume that this difference is due to the processing time for the input output of video database on HDFS, the access time of external programs written in the Octave and Ruby programming languages, and the latency time of performing the Map processing. Figures 15 and 16 show respectively the total processing time and the processing time of each TaskTracker when a sequence of eight images is processed by using MapReduce in pseudo-distributed mode. Figure 17 shows an example of a part of the program code for feature extraction of video images using MapReduce.
15 Time[s] Muneto Yamamoto and Kunihiko Kaneko 225 Figure15. Versus the maximum number of slave processes Processing Time for every TaskTracker Map Task (Maximum) TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker Map Task(Maximum) Figure16. Processing time for each TaskTracker for different numbers of Map tasks performed simultaneously 5. SUMMARY AND FUTURE WORK In this paper, we describe using a video database of video collected with a video camera. In an experiment, we processed sequences of video frames with MapReduce to create grayscale images and extracted some features of the video images. In the process of creating the grayscale images, each video frame was divided into multiple parts. In the extraction, frame numbers were used as the key numbers to extract some features of the video images. In future, it is necessary for us to build a distributed environment that combines multiple machines, conduct large-scale experiments involving sequential video images, evaluate the
16 226 International Journal of Electronic Commerce Studies performance speed of the implementation with MapReduce, and verify the efficient key-value pairs. Figure17. Example program for feature extraction of video images using MapReduce 6. ACKNOWLEDGMENT This work was supported by Grant-in-Aid for Scientific Research (C) REFERENCES [1] D. Jeffrey, and G. Sanjay, MapReduce: Simplified data processing on large clusters. Paper presented at OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, USA, December 6-8, [2] D. Sudipto, S. Yannis, S.B. Kevin, G. Rainer, J.H. Peter, and M. John, Ricardo: Integrating R and hadoop. Paper presented at the 2010 international conference on Management of data, Indianapolis, USA, June 6-11, 2010.
17 Muneto Yamamoto and Kunihiko Kaneko 227 [3] S. Chris, L. Liu, A. Sean, and L. Jason, HIPI: A hadoop image processing interface for image-based map reduce tasks, B.S. Thesis. University of Virginia, Department of Computer Science, [4] W. Keith, C. Andrew, K. Simon, G. Jeff, B. Magdalena, H. Bill, K. YongChul, and B. Yingyi, Astronomical image processing with hadoop. Paper presented at Astronomical Data Analysis Software and Systems XX, Boston, USA, November 7-11, [5] H. Chris, and S. Mike, A combined corner and edge detector. Paper presented at the 4th Alvey Vision Conference, Manchester, England, August 31 - September 2, [6] K. Yasushi, and K. Kenichi, Detection of feature points for computer vision. The Journal of the Institute of Electronics, Information, and Communication Engineers, 87(12), p , 2004.
18 228 International Journal of Electronic Commerce Studies
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
MapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
How To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
Apache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),
Performance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
HADOOP MOCK TEST HADOOP MOCK TEST I
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
Processing Large Amounts of Images on Hadoop with OpenCV
Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru
Hadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Understanding Big Data and Big Data Analytics Getting familiar with Hadoop Technology Hadoop release and upgrades
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
HDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
HADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
An Efficient Analysis of Web Server Log Files for Session Identification using Hadoop Mapreduce
Proc. of Int. Conf. on Advances in Communication, Network, and Computing, CNC An Efficient Analysis of Web Server Log Files for Session Identification using Hadoop Mapreduce Savitha K 1 and Vijaya MS 2
Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez (200950006) Sameer Kumar (200950031)
using Hadoop MapReduce framework Binoy A Fernandez (200950006) Sameer Kumar (200950031) Objective To demonstrate how the hadoop mapreduce framework can be extended to work with image data for distributed
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan [email protected]
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan [email protected] Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13
How to properly misuse Hadoop Marcel Huntemann NERSC tutorial session 2/12/13 History Created by Doug Cutting (also creator of Apache Lucene). 2002 Origin in Apache Nutch (open source web search engine).
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
A Cost-Evaluation of MapReduce Applications in the Cloud
1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team 2/23 1 MapReduce applications - case study 2 3 4 5 3/23 MapReduce
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
The Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen [email protected] Abstract. The Hadoop Framework offers an approach to large-scale
Distributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science [email protected] April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
TP1: Getting Started with Hadoop
TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web
Telecom Data processing and analysis based on Hadoop
COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China
IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM
IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM Sugandha Agarwal 1, Pragya Jain 2 1,2 Department of Computer Science & Engineering ASET, Amity University, Noida,
Tutorial for Assignment 2.0
Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Image Search by MapReduce
Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s
t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from
Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
MapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
CS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)
Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
