Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services

Size: px
Start display at page:

Download "Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services"

Transcription

1 RESEARCH ARTICLE Adv. Sci. Lett. 4, , 2011 Copyright 2011 American Scientific Publishers Advanced Science Letters All rights reserved Vol. 4, , 2011 Printed in the United States of America Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services Myoungjin Kim 1, Yun Cui 1, Seungho Han 1, Hanku Lee 1,2,* 1 Department of Internet and Multimedia Engineering, Konkuk University, Seoul , Korea 2 Center for Social Media Cloud Computing, Konkuk University, Seoul , Korea In recent times, significant progress has been achieved in cost-effective and timely processing of large amounts of data through Hadoop based on the emerging MapReduce framework. Based on these developments, we proposed a Hadoop-based Distributed Video Transcoding System which transcodes large video data sets into specific video formats depending on userrequested options. In order to reduce the transcoding time exponentially, we apply a Hadoop Distributed File System and a MapReduce framework to our system. Hadoop and MapReduce are designed to process petabyte-scale text data in a parallel and distributed manner. However, our system processes multi-media data. In this study, we measure the total transcoding time for various values of five MapReduce tuning parameters: block replication factor, Hadoop Distributed File System block size, Java Virtual Machine reuse option, maximum number of map slots and input/output buffer size. Thus, based on the experimental results, we determine the optimal values of the parameters affecting transcoding processing in order to improve the performance of our Hadoop-based system that processes a large amount of video data. From the results, it is clearly observed that our system exhibits a notable difference in transcoding performance depending on the values of the MapReduce tuning parameters. Keywords: Performance Tuning, Distributed Transcoding, MapReduce, Hadoop Optimization, Cloud Computing. 1. INTRODUCTION In recent times, Hadoop based on the MapReduce model has gained considerable attention because the features of the data preprocessing techniques are not timeconsuming and are suitable for processing large-scale data. In particular, MapReduce is emerging as an important programming model for developing distributed dataprocessing applications such as web indexing, data mining, log file analysis, financial analysis, and scientific research for processing petabyte-scale or terabyte-scale text-based data rather than multimedia data (images, videos, audio). * Address: tough105@konkuk.ac.kr In order to reduce transcoding time significantly, we proposed a Hadoop-based Distributed Video Transcoding System (HDVTS) based on MapReduce 2 running on a Hadoop Distributed File System (HDFS) 4. The proposed system is able to transcode a variety of video coding formats into the MPEG-4 video format. Improvements in quality and speed are realized by adopting a HDFS for storing large amounts of video data created from numerous users, MapReduce for distributed and parallel processing of video data, and Xuggler libraries for transcoding based on an open source. Performance optimization in distributed dataprocessing applications that use MapReduce has been considered as an important research topic. Many studies have focused on three approaches for tuning MapReduce 1 Adv. Sci. Lett. Vol. 4, No. 2, /2011/4/400/008 doi: /asl

2 Adv. Sci. Lett. 4, , 2011 RESEARCH ARTICLE programming: setting tuning parameters for job and task configuration in MapReduce 3,8, improving the existing scheduling strategies in Hadoop 5, and optimizing MapReduce in heterogeneous clusters 7. MapReduce is widely utilized for large-scale text data analysis in the cloud computing 6,9,10 environment. Several MapReduce tuning parameters must be set by users and administrators who manipulate MapReduce applications. Hence, in order to assist unqualified administrators, Shivnath 8 presented techniques that automate the process of setting the tuning parameters for MapReduce programs. In addition, Guanying Wang et al. 3 also presented MRPerf to facilitate the exploration of the MapReduce design space by analyzing various aspects of a MapReduce setup. Their experiments analyzed a TeraSort job, which is a standard benchmark for evaluating the sorting of terabyte text data. Thus, these performance tuning techniques are applicable only to MapReduce programs that are suitable for terabyte-scale or petabyte-scale text data. However, the MapReduce framework applied to our system processes multi-media data. Hence, optimal tuning parameters for video transcoding processing in Hadoop must be considered. The job configuration and the task tracker configuration parameters play a significant role in the performance of Hadoop applications. In this study, the optimal values of the parameters affecting the transcoding performance are determined by measuring the total transcoding time for various values of five parameters: dfs.replication, dfs.block.size, mapred.job.reuse.jvm.num.tasks, mapred.tasktracker.map.tasks.maximum, and io.file.buffer.size representing the block replication factor, block size, Java Virtual Machine (JVM) reuse option, maximum number of map slots, and buffer size, respectively. The main contribution of our study is to present outof-the-box performance for a MapReduce application that processes huge amounts of video data sets on our cloud cluster, and to provide the optimal values of tuning parameters specified in the media processing system to MapReduce program users who are not familiar with the configuration of Hadoop options. The remainder of this paper is organized as follows: Hadoop and performance tuning are described in Section 2. Section 3 consists of an overview of a Hadoop-based Distributed Video Transcoding System (HDVTS). In Section 4, the Hadoop configuration parameters for performance tuning are presented. Further, the hardware and data sets used in the experiments, and experimental methods are described. In Section 5, the results of several experiments conducted on our cloud cluster are discussed and analyzed. Section 6 contains the conclusion. 2. Hadoop Framework Hadoop, inspired by Google s MapReduce and Google File System 1, is a software framework that supports data-intensive distributed applications handling thousands of nodes and petabytes of data. It can perform scalable and timely analytical processing of large data sets to extract useful information. Hadoop consists of two important frameworks: 1) Hadoop Distributed File System (HDFS), scalable and portable file system written in Java. 2) MapReduce is the first framework for processing large data. HDFS is a distributed file system for supporting applications that process petabyte-scale or gigabyte-scale data sets throughout a cluster or commodity machines. HDFS has master-slave architecture, with a master server called NameNode and slaves called DataNodes. The NameNode controls file operations such as open, close, and rename, and the DataNode is responsible for file read and write operations requested by clients. In order to ensure fault tolerance, HDFS splits the huge amounts of data sets into blocks (default block size: 64 MB), and stores them with replication across each data node. The MapReduce framework provides a specific programming model and a run-time system for processing and creating large data sets amenable to various realworld tasks. This framework also handles automatic scheduling, communication, and synchronization for processing huge datasets and it has fault tolerance capability. The MapReduce programming model is executed in two main steps called mapping and reducing. Mapping and reducing are defined by mapper and reducer functions. Each phase has a list of key and value pairs as input and output. In the mapping step, MapReduce receives input data sets and then feeds each data element to the mapper in the form of key and value pairs. In the reducing step, all the outputs from the mapper are processed, and the final result is generated by the reducer using the merging process. 3. Overview of HDVTS In this section, we briefly describe our proposed system architecture. 1) Our system contains a codec transcoding function and a function with a different display size, codec method, and container format. 2) Our system mainly focuses on the batch processing of large numbers of video files collected for a fixed period rather than the processing of small video files collected in real time. 3) HDFS is applied to our system in order to avoid the high cost of the communication of the video file while data transfer occurs for distributed processing. HDFS is also applied to our system due to the large chunk size (64MB) policy suitable for processing video files and the user-level distributed system. 4) Our system follows load balancing, fault tolerance, and merging and splitting policies provided from MapReduce for distributed processing. HDVTS is mainly divided into four domains: Video Data Collection Domain (VDCD), HDFS-based Splitting and Merging Domain (HbSMD), MapReduce-based Transcoding Domain (MbTD) and Cloud-based Infrastructure Service Domain (CbISD). The core processing for video transcoding is briefly explained as follows: The proposed system uses HDFS as storage for 2

3 RESEARCH ARTICLE Adv. Sci. Lett. 4, , 2011 Fig. 1. Diagram of a Hadoop-based Distributed Video Transcoding System distributed parallel processing. The extremely large amount of collected data is automatically distributed in the data nodes of HDFS. For distributed parallel processing, the proposed system exploits the Hadoop MapReduce framework. In addition, Xuggler libraries for video resizing and encoding are utilized in Mapper. The Map function processes each chunk of video data in a distributed and parallel manner. Figure 1 shows the digram of an HDVTS. In this prototype, users and administrators can select video transcoding options such as format, codec, bitrate, width, and height, and audio transcoding options such as codec, bitrate, and sample rate. Further, the summary information of the system including the available storage capacity of HDFS, the activation state of data nodes, and the progress status report for MapReduce job are monitored. 4 Hadoop Configuration Parameters for Performance Tuning The performance of a MapReduce job in Hadoop can be controlled by more than 190 Job, Jobtracker, and Tasktracker configuration parameters. We select five parameters that are expected to significantly affect the performance tuning of a transcoding MapReduce job. Table 1 lists a subset of the selected tuning parameters that are used to provide empirical evidence for the verification of performance difference with respect to transcoding time. The configuration of these parameters controls job behavior. Therefore, the adjustment and combination of parameters must be determined appropriately based on the size and type of data sets. The values of dfs.replication and dfs.block.size can be changed via the hdfs-site.xml file in Hadoop. The values of mapred.job.reuse.jvm.num.tasks and mapred.tasktracker.map.tasks.maximum can be adjusted via the mapred-site.xml file. The value of io.file.buffer.size can be modified in the core.site.xml file. Performance evaluation is conducted on a 28 node HDFS cluster consisting of 1 master node and 27 slave nodes (data node). Each node running on the Linux OS (CentOS 5.5) is equipped with two Intel Xeon 4 core 2.13GHz processors with 4GB registered ECC DDR memory and 1TB SATA-2. All nodes are interconnected with a 100Mbps Ethernet adapter. We also use Java 1.6.0_23, Hadoop Xuggler 3.4 for video transcoding. To verify the performance evaluation for encoding very large amounts of video files into target files, we create and use three types of video data sets (5, 10, 20 GB) having different sizes. The total time to transcode the original video data sets (Xvid, AVI, 200 MB, ) into target files (MPEG4, MP4, 60 MB, ) is measured. 3 Adv. Sci. Lett. Vol. 4, No. 2, /2011/4/400/008 doi: /asl

4 Adv. Sci. Lett. 4, , 2011 RESEARCH ARTICLE Table. 1. A subset of tuning configuration parameters in Hadoop Parameter Name Default Value Values Considered dfs.replication 3 1, 2, 3, 4, 5 dfs.block.size 64MB 32 MB, 64MB, 128MB, 256 MB, 512MB mapred.job.reuse.jvm.num.tasks 1 1, -1 io.file.buffer.size 4K 4 KB, 128 KB, 256 KB, 512 KB, 1024 KB mapred.tasktracker.map.tasks.maximum 2 2, 4, 8, 12, 16 A comparison between the transcoding time with default values for the Hadoop options and the transcoding time with other values for these options is performed. The following default Hadoop option values are used during the experiments: (1) JVM runs in server mode with 1024 MB heap memory for map tasks, (2) JVM reuse option is enabled, (3) HDFS block size is 64 MB, (4) Block replication factor is three, and (5) I/O file buffer size is 4 KB. The optimal values of the tuning parameters are determined by analyzing the transcoding performance for five sets of experiments: (1) block replication factor is varied (1, 2, 3, 4, 5), (2) block size is varied (32 MB, 64 MB, 128 MB, 256 MB, 512 MB), (3) buffer size is changed, (4) maximum number of task trackers is varied, and (5) JVM reuse option is enabled (value: -1) and disabled (value: 1). lower storage space requirement for the replicas generated based on dfs.replication. Further, in our system, the performance is better for dfs.block.size value of 256 MB or 512 MB than for other values. Thus, this parameter should be set to a value larger than or approximately equal to the original file size, which is 200 MB in our experiments. 5 Evaluation and Results In this section, we demonstrate the differences in job running times for transcoding processing depending on appropriate and inappropriate parameter settings for five sets of experiments. With the default values for Hadoop options, our system provides excellent transcoding time performance for very large amounts of video data sets. For example, According to Figure 2, the transcoding process requires approximately 236 sec (about 4 min), 385 sec (about 6 min), and 696 sec (about 12 min) for 5 GB, 10 GB, and 20 GB data sets, respectively. Figures 2 and 3 show the encoding time required to complete the transcoding process for three different data sets. Two tuning parameters, dfs.replication representing the HDFS block replication factor, and dfs.block.size representing the HDFS block size, are varied in Figures 2 and 3, respectively, and the value of other tuning parameters is the same as the default values. From these figures, a significant impact on transcoding performance is observed. First, when dfs.replication is set to 2 or 3, our system shows an improvement in the transcoding process. Between these two values, the value 2 is preferable because it has a Fig.2. Total transcoding time in Hadoop for 5GB, 10GB, and 20GB data sets, for various values of dfs.replication Fig.3. Total transcoding time in Hadoop for 5GB, 10GB, and 20GB data sets, for various values of dfs.block.size 4

5 RESEARCH ARTICLE Adv. Sci. Lett. 4, , 2011 The main role of io.file.buffer.size is to change the buffer size that is used to read and write sequence files. In this set of experiments, when the buffer size is set to 128 KB or 256 KB, the performance shows an average improvement of 3% compared with the performance obtained using the default value. However, when the buffer size is set to 512 KB or 1024 KB, the performance improvement varies according to the size of the data sets. The results are represented in Figure 4. difference is negligible. Hadoop runs map tasks in its own JVM when the same job is performed. At this time, the overhead time required by each map task to prepare to use the JVM is approximately 1 sec. Thus, when the JVM reuse option is enabled for many map tasks having a short life cycle, the performance improves. However, the number of map tasks in a MapReduce job to process 5 GB video data sets is only 20. Hence, this set of experiments with 5GB video data sets does not demonstrate a large difference in the transcoding performance. Fig.4. Total transcoding time in Hadoop for 5GB, 10GB, and 20GB data sets, for various values of io.file.buffer.size In the fourth set of experiments, we focus on exploring and analyzing different values for mapred.tasktracker.map.tasks.maximum. This option represents the maximum number of map tasks performed simultaneously on a single data node. Before performing this set of experiments, we expected that the transcoding job performance for the maximum number of map slots would depend on the number of CPUs in a physical machine. i.e., if the value is set to 4, four map tasks to process the MapReduce job are performed at a single data node simultaneously. It was expected that a value with 8 would exhibit better performance than other values. In fact, from the experimental results shown in Figure 5, the best transcoding performance is achieved when the value of this option is set to 8 because our system has eight CPUs on each node. We exploit the inherent features of Hadoop to alleviate the bookkeeping overhead. In particular, we run multiple map tasks in one JVM by using the mapred.job.reuse.jvm.num.tasks parameter. If the JVM reuse option is enabled by setting the value as -1, there is no limit on the number of times that the same JVM can be reused for map tasks. JVM reuse is disabled by setting the value as 1, and then, a map task can use a JVM only once. From Figure 6, although better performance is observed with the value -1 than with the value 1, this Fig.5. Total transcoding time in Hadoop for 5GB, 10GB, and 20GB data sets, for various values of mapred.tasktracker.map.tasks.maximum Fig.6. Total transcoding time in Hadoop for 5GB, 10GB, and 20GB data sets, for various values of mapred.job.reuse.jvm.num.tasks 6 Conclusions This study aims to determine the optimal values of the tuning parameters in a Hadoop-based distributed video transcoding system by measuring the total transcoding time for various values of five parameters: block size, block replication factor, JVM reuse factor, 5

6 Adv. Sci. Lett. 4, , 2011 RESEARCH ARTICLE maximum number of map slots, and buffer size. From experiments, it is observed that our system exhibits good performance for the media transcoding processes when the block size has a value that is greater than or nearly equal to the original file size, and the block replication factor and JVM reuse factor are configured as 3 and -1, respectively. Furthermore, when the buffer size is set to 128 KB or 256 KB, and the maximum number of map slots is set to a value approximately equal to the number of CPUs in a single data node, our system delivers good performance for media transcoding processes. ACKNOWLEDGMENTS This research was supported by the MSIP(Ministry of Science, ICT & Future Planning), Korea, under the ITRC(Information Technology Research Center) support program (NIPA-2014-H ) supervised by the NIPA(National IT Industry Promotion Agency). REFERENCES [1] Ghemawat, S., Gobioff, H., Leung, S.-T. The Google file system, Operating Systems Review (ACM), 37(2003), [2] Dean, J., Ghemawat, S. MapReduce: Simplified data processing on large clusters, Communication of the ACM, 51(2008), [3] Wang, G., Butt, A.R., Pandey, P., Gupta, K. Using realistic simulation for performance analysis of MapReduce setups, Proceedings of 1st ACM Workshop on Large-scale System and Application Performance, Art no , [4] Shafer, J., Rixner, S., Cox, A.L. The Hadoop Distributed Filesystem: Balancing Portability and Performance, Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems and Software, Art no (2010), [5] Polo, J., Carrera, D., Becerra, Y., Torres, J., Ayguade, E., Steinder, M., Whalley, I. Performance-Driven Task Co-Scheduling for MapReduce Environments, Proceedings of 12th IEEE/IFIP Network Operations and Management, Art no (2010), [6] Ambrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., Zaharia, M. A view of cloud computing, Communication of the ACM, 53(2010), [7] Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares, A., Qin, X. Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters, Proceedings of 2010 IEEE International Symposium on Parallel and Distributed Processing, Art no (2010) [8] Babu, S. Towards Automatic Optimization of MapReduce Programs, Proceeding of 1st ACM Symposium on Cloud Computing, (2010), [9] Zhang, Q., Cheng, L., Boutaba, R. Cloud Computing: State-ofthe-art and research challenges, Journal of Internet Services and Applications, 1(1)(2010), 7-18 [10] Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J., Brandic, I. Cloud Computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5 th utility, Future Generation Computer Systems, 25(6)(2009),

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT 1 SEUNGHO HAN, 2 MYOUNGJIN KIM, 3 YUN CUI, 4 SEUNGHYUN SEO, 5 SEUNGBUM SEO, 6 HANKU LEE 1,2,3,4,5 Department

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

A Hadoop-based Multimedia Transcoding System for Processing Social Media in the PaaS Platform of SMCCSE

A Hadoop-based Multimedia Transcoding System for Processing Social Media in the PaaS Platform of SMCCSE KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 6, NO. 11, Nov 2012 2827 Copyright 2012 KSII A Hadoop-based Multimedia Transcoding System for Processing Social Media in the PaaS Platform of

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

Adaptive Task Scheduling for Multi Job MapReduce

Adaptive Task Scheduling for Multi Job MapReduce Adaptive Task Scheduling for MultiJob MapReduce Environments Jordà Polo, David de Nadal, David Carrera, Yolanda Becerra, Vicenç Beltran, Jordi Torres and Eduard Ayguadé Barcelona Supercomputing Center

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Dynamic Resource Allocation And Distributed Video Transcoding Using Hadoop Cloud Computing

Dynamic Resource Allocation And Distributed Video Transcoding Using Hadoop Cloud Computing Dynamic Resource Allocation And Distributed Video Transcoding Using Hadoop Cloud Computing Shanthi.B.R 1, Prakash Narayanan.C 2 M.E, Department of CSE, P.S.V College of Engineering and Technology, Krishnagiri,

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

Research Article Hadoop-Based Distributed Sensor Node Management System

Research Article Hadoop-Based Distributed Sensor Node Management System Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Prof.Deepak Gupta Computer Department, Siddhant College of Engineering Sudumbhare, Pune, Mahrashtra,India

Prof.Deepak Gupta Computer Department, Siddhant College of Engineering Sudumbhare, Pune, Mahrashtra,India Image data conversion module using Hadoop in cloud computing enviornment Prof.Deepak Gupta Computer Department, Siddhant College of Engineering Sudumbhare, Pune, Mahrashtra,India Ms.Vaishali Patil Computer

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM

SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM 1 KONG XIANGSHENG 1 Department of Computer & Information, Xinxiang University, Xinxiang, China E-mail:

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Facilitating Consistency Check between Specification and Implementation with MapReduce Framework

Facilitating Consistency Check between Specification and Implementation with MapReduce Framework Facilitating Consistency Check between Specification and Implementation with MapReduce Framework Shigeru KUSAKABE, Yoichi OMORI, and Keijiro ARAKI Grad. School of Information Science and Electrical Engineering,

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Paolo Garza, Paolo Margara, Nicolò Nepote, Luigi Grimaudo, and Elio Piccolo Dipartimento di Automatica e Informatica, Politecnico di Torino,

More information

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Wen-Hsu HSIEH *, San-Peng KAO **, Kuang-Hung TAN **, Jiann-Liang CHEN ** * Department of Computer and Communication, De Lin Institute

More information

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM Ramesh Maharjan and Manoj Shakya Department of Computer Science and Engineering Dhulikhel, Kavre, Nepal lazymesh@gmail.com,

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

More information

Hadoop Technology for Flow Analysis of the Internet Traffic

Hadoop Technology for Flow Analysis of the Internet Traffic Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

A Hadoop MapReduce Performance Prediction Method

A Hadoop MapReduce Performance Prediction Method A Hadoop MapReduce Performance Prediction Method Ge Song, Zide Meng, Fabrice Huet, Frederic Magoules, Lei Yu and Xuelian Lin University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France Ecole Centrale

More information

New Cloud Computing Network Architecture Directed At Multimedia

New Cloud Computing Network Architecture Directed At Multimedia 2012 2 nd International Conference on Information Communication and Management (ICICM 2012) IPCSIT vol. 55 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V55.16 New Cloud Computing Network

More information

HADOOP PERFORMANCE TUNING

HADOOP PERFORMANCE TUNING PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Generic Log Analyzer Using Hadoop Mapreduce Framework

Generic Log Analyzer Using Hadoop Mapreduce Framework Generic Log Analyzer Using Hadoop Mapreduce Framework Milind Bhandare 1, Prof. Kuntal Barua 2, Vikas Nagare 3, Dynaneshwar Ekhande 4, Rahul Pawar 5 1 M.Tech(Appeare), 2 Asst. Prof., LNCT, Indore 3 ME,

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Telecom Data processing and analysis based on Hadoop

Telecom Data processing and analysis based on Hadoop COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Yuji Shirasaki (JVO NAOJ)

Yuji Shirasaki (JVO NAOJ) Yuji Shirasaki (JVO NAOJ) A big table : 20 billions of photometric data from various survey SDSS, TWOMASS, USNO-b1.0,GSC2.3,Rosat, UKIDSS, SDS(Subaru Deep Survey), VVDS (VLT), GDDS (Gemini), RXTE, GOODS,

More information

Java Garbage Collection Characteristics and Tuning Guidelines for Apache Hadoop TeraSort Workload

Java Garbage Collection Characteristics and Tuning Guidelines for Apache Hadoop TeraSort Workload Java Garbage Collection Characteristics and Tuning Guidelines for Apache Hadoop TeraSort Workload Shrinivas Joshi, Software Performance Engineer Vasileios Liaskovitis, Performance Engineer 1. Introduction

More information

How To Use Hadoop

How To Use Hadoop Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster

Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster Amresh Kumar Department of Computer Science & Engineering, Christ University Faculty of Engineering

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Entering the Zettabyte Age Jeffrey Krone

Entering the Zettabyte Age Jeffrey Krone Entering the Zettabyte Age Jeffrey Krone 1 Kilobyte 1,000 bits/byte. 1 megabyte 1,000,000 1 gigabyte 1,000,000,000 1 terabyte 1,000,000,000,000 1 petabyte 1,000,000,000,000,000 1 exabyte 1,000,000,000,000,000,000

More information

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Deep Mann ME (Software Engineering) Computer Science and Engineering Department Thapar University Patiala-147004

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON BIG DATA MANAGEMENT AND ITS SECURITY PRUTHVIKA S. KADU 1, DR. H. R.

More information