Efficient Data Replication Scheme based on Hadoop Distributed File System

Size: px
Start display at page:

Download "Efficient Data Replication Scheme based on Hadoop Distributed File System"

Transcription

1 , pp Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing, Korea Institute of Science and Technology Information, Korea, 2 Dept. of Computer Science, Korea National Open University, Korea, 3 Division of General Education, Seokyeong University, Korea, Abstract Hadoop distributed file system (HDFS) is designed to store huge data set reliably, has been widely used for processing massive-scale data in parallel. In HDFS, the data locality problem is one of critical problem that causes the performance decrement of a file system. To solve the data locality problem, we propose an efficient data replication scheme based on access count prediction in a Hadoop framework. By the previous data access count, the existing data replication scheme predicts the next access count of data files using Lagrange s interpolation. Then, the proposed data replication scheme determines the replication factor with the predicted data access count, whether it generates a new replica or it uses the loaded data as cache selectively. Finally, the proposed scheme provides improvement of data locality. By performance evaluation, proposed efficient data replication scheme is compared with default data replication setting of Hadoop that shows proposed scheme reduces averagely 8.9% of the task completion time in the map phase. Regarding the data locality, proposed scheme provides the increase of node locality by 6.6% and the decrease of rack and rack-off locality by 38.9% and 56.5%. Keywords: Hadoop, Data locality, Access Prediction, Data Replication, Data Placement 1. Introduction Because of requirement of efficient programming model for large scale data in a distributed computing environment, MapReduce is developed by Google [1]. Hadoop is one of the open sources implementation of the MapReduce. Hadoop Distributed File System (HDFS) is developed by Apache Foundation that provides advantages of the power of high-speed computing clusters and storage and high performance in big data storage [1-5]. Using HDFS, high availability and fault-tolerance such as replication are provided. In HDFS, to provide data locality, Hadoop tries to automatically collocate the data with the computing node. Hadoop schedules Map tasks to set the data on same node and the same rack. This is data locality that is a principal factor of Hadoop s performance. [1, 6]. In Hadoop scheduling policy, there is the case of the data locality problem that can occur, when the assigned node load the data block from another node. The main factor of data locality in Hadoop refers to the distance between data and the assigned node. There are three types of data locality in Hadoop as follows: (1) Node locality: when data for processing are stored in the local storage, * Corresponding Author ISSN: IJSEIA Copyright c 2015 SERSC

2 (2) Rack locality: when data for processing are not stored in the local storage, but another node within the same rack, (3) Rack-off locality: when data for processing are not stored in the local storage and nodes within the same rack, but another node in a different rack. In this paper, we propose an efficient data replication scheme in a Hadoop framework based on access count prediction. Proposed data replication scheme focused on improving the data locality in the map phase, and reducing the overhead of data transfer that increases total execution time. In proposed data replication scheme, we efficiently determined the increasing replication factor and avoiding the unnecessary data replication. The contributions of this paper are as follows. Proposed scheme optimizes and maintains the replication factor effectively by proposed predictor. Proposed scheme minimizes the data transfer load between racks by proposed data replication algorithm. Proposed scheme reduces the processing time of MapReduce jobs by improvement of data locality. The rest of this paper is organized as follows. In Section 2, we discuss previous works on data locality in a MapReduce framework and introduce problems of data locality. The proposed efficient data replication scheme based on access count prediction is presented in Section 3. Section 4 shows the performance evaluation of proposed scheme. Finally, Section 5 gives our conclusions. 2. Related Works 2.1. Previous Works There are several previous researches of data replication in HDFS. Improving fault tolerance, [6-9] use data replication in HDFS. They are mainly focused on fault tolerance to overcome unexpected failures. Recently, some of the research [10-12] are focused on improving data locality for efficient execution on data replication in Hadoop. Also, some scheduled research [13-14] are proposed to improve data locality. Table 1 shows feature of data replication schemes and scheduling methods. Table 1. Previous Data Replication Schemes and Scheduling Methods Data Replication Scheme Scheme Technique Weakness [10] dynamic data replication access patterns remove replicated data [11] data placement balancing for requirement depend on application [12] data prefetching prediction by log depend on log data [13] data locality aware of scheduling Scheduling Methods waiting time estimation / data transfer time [14] delay scheduling base on delay 2.2. Data Locality Problem This section describes the data locality problem and types of data locality in Hadoop. Data locality related with the distance between data and the processing node. So, if the 178 Copyright c 2015 SERSC

3 closer distance between data and node, it has the better data locality. Figure 1 shows three types of data locality in Hadoop: node locality, rack locality, and rack-off locality. Figure 1. Example of Data Locality The data locality problem can be defined as a situation that a task is scheduled with rack or rack-off locality. It could cause poor performance. Regarding the data locality, the overhead of rack-off locality is greater than overhead of rack locality. To prevent the data locality problem, we propose an efficient data replication scheme using prediction by the access count of data files and a data replication placement algorithm reducing case of rack and rack-off locality. 3. Efficient Data Replication Scheme The diagram of a MapReduce is shown at Figure 2. The proposed modules are marked with red dotted rectangle in HDFS Access Count Prediction Figure 2. Diagram of Efficient Data Replication The basic idea of determining replication is based on the different replication factor per data file. Too much replication does not always guarantee the better data locality. However, the probability with node locality is getting higher when the replication is enough to access. To determine the efficient replication factor, a prediction method is demanded to forecast the next access counts of data. To accomplish this work, the amount of access counts that changes over time could be expressed as a mathematical formula. Copyright c 2015 SERSC 179

4 However, because of randomly access to a data, a constant function is not considerable. Therefore, we apply Lagrange s interpolation using a polynomial expression to extract predicted access count of data. The mathematical formula is given below: In the equation (1), Let N be the number of points, Let x i be the the i th point, and Let f i be the function of x i. To calculate the predicted access count, substitute x by time t, where t is time of access occurred y by an access count at t. Table 2 shows proposed access count prediction algorithm. In this algorithm, t i is the time at which i th access is made, avg is the average time interval between accesses, and AccessList is the access count at t i. AccessPrediction(AccessList[ ]) /* Step 1. Initialization of the variables */ int sum = 0, Threshold = 0, temp=0; float tempx = 1.0; Table 2. Access Count Prediction Algorithm /* Step 2. Calculation of the average time interval */ for (i = 0; i < n; i++){ // n is the number of time stamps. temp = t i+1 t i ; sum = sum + temp; avg = sum / n /* Step 3. Prediction of the next access */ t next = t n-1 + avg /* Step 4. Calculation of the number of future access */ For (i = 0; i < n; i++){ For (j = 0; j < n; j++){ if (i =! j){ temp = (t next t j )/( t i t j ); tempx =tempx x temp; temp = tempx x AccessList[i]; Threshold = Threshold + temp; return Threshold; (1) 180 Copyright c 2015 SERSC

5 3.2. Efficient Data Replication and Replica Placement This subsection describes the efficient data replication algorithm that based on access count prediction. The proposed algorithm compares the access count by the current replication factor and the accessed replication factor by prediction. Furthermore, we present the replica placement algorithm to effectively reduce the number of nodes with rack or rack-off locality. Table 3 shows the efficient data replication algorithm based on access count prediction. F i means the i th file, Demand i means the demand count of the i th file, and replica i means the replication factor of the i th file. AdaptiveDataReplication( ) /* Step 1. Requesting a task */ if(# of TaskTracker s idle slots >= 1){ request a task to JobTracker; Table 3. Efficient Data Replication Algorithm /* Step 2. Checking the data locality of tasks */ if( # of tasks with the node locality >= 1){ assign a task of them; else if(# of tasks with the rack locality >= 1){ assign a task of them; else{ assign a task with the rack-off locality; /* Step 3. Increasing the number of accesses */ if( the assignment in Step 2 is the first assignment for the job){ for(i =0; i > F; i++) { // F is the number of job files. Access i = Access i + 1; // Access i is the access count of i-th file. /* Step 4. Obtaining the value of Threshold */ Threshold = AccessPrediction(AccessList); // AccessList includes from Access 0 to Access n. /* Step 5. Creating the caches or replicas */ for (i =0; i > F; i++){ if(replica i >= Threshold){ create some of the cache when task has no node locality; else if(replica i < Threshold){ create the replica of the corresponding file; Table 4 shows the replica placement algorithm for data locality. In this algorithm, Rack i means the i th rack, Rack selected means the current selected rack, node inturen means the current selected node, and Replica n means the n th replica. Propose replica placement algorithm is focused on improving data locality especially for the rack-off locality. Copyright c 2015 SERSC 181

6 Table 4. Replica Placement Algorithm ReplicaPlacement( ) /* Step 1. Selection of a rack to store the replica*/ for(rack i belong to the circular linked list of racks){ // R is the number of racks. if(replica n does not exist in the Rack i ){ Rack seleted = Rack i ; goto Step 2; if(all the racks have Replica n ){ Rack seleted = the first rack to be searched; /* Step 2. Selection of a node to store the replica */ select the Node inturn belong to the circular linked list of nodes on the Rack selected ; /* Step 3. Store of the replica */ store Replica n to Node inturn on the Rack selected ; register the information of Replica n to NameNode; 4. Performance Evaluation 4.1. Evaluation Environment In evaluation, the Hadoop cluster composes one master node and eight slave nodes and Hadoop Each node consists with Intel Core i5 CPU and 8GM RAM. Within a single rack, nodes are connected by Giga bit Ethernet switches. And, between racks, fast Ethernet routers are used. We conduct the wordcount application with variegation of input data sizes: 1.3GB, 1.9GB, 2.5GB, 3.2GB, and 4.4GB. Based on the logs of real job trace, we evaluate our efficient data replication scheme compared with the default setting of Hadoop replication. Table 4. Configurations of Simulation Cluster configuration HDFS configuration Hadoop configuration Number of master node 1 Number of slave nodes 100 Number of racks 3 Number of replicas 3 Block size 64MB File1 size 3.2GB File2 size 1.9GB File3 size 1.3GB File4 size 2.5GB File5 size 4.4GB Scheduler Fair scheduler Number of concurrent jobs 6 Number of shared files Copyright c 2015 SERSC

7 4.2. Performance Results Figure 3 shows comparison of the map phase completion time of map phase between proposed scheme and Hadoop default. For 6 jobs, 216 map tasks are spawned. The average completion time of the map phase in Hadoop default is 90.5 seconds, whereas average completion time of proposed scheme is 81 seconds that shows 8.9% of performance improvement. Figure 3. Comparison of the Completion Time of Map Phase between Proposed Scheme and Hadoop Default Figure 4 shows the number of map tasks with node, rack, and rack-off locality. In comparison with the Hadoop default, proposed scheme provides the increase of node locality by about 4.5% and the decrease of rack and rack-off locality by about 11.6% and 20.9%. Figure 4. Comparison of Data Locality between Proposed Scheme and Hadoop Default Figure 5, 6 shows the number of map tasks with data locality with variegation of slave nodes. The largest enlargement of node locality takes place when the number of slave nodes is 130, in comparison with the Hadoop default. In this case, node locality is increased by about 6.6% and rack and rack-off locality is decreased by about 38.9% and 56.5%. Copyright c 2015 SERSC 183

8 Figure 5. Comparison of Node Locality with Variegation of Nodes 9-a Figure 6. Comparison of Rack/Rackoff Locality with Variegation of Nodes Figure 7 shows comparison of the completion time of map phase with variegation numbers files. The number of Map tasks are 52 for 1 shared file, 83 for 2 shared file, 104 for shared 3 file, 144 for 4 shared file, and 216 for 5 shared file. Figure 7. Comparison of the Completion Time of Map Phase with Variegation of Files 184 Copyright c 2015 SERSC

9 6. Conclusion To solve the data locality problem, we proposed an efficient data replication scheme in a Hadoop framework. Proposed efficient data replication scheme aims at improving the data locality in the map task phase, and reducing the total processing time. By prediction of access counts of data file, we optimize the replication factor per each data file. Proposed data replication scheme determines generating new replica or using the loaded data as cache. By performance evaluation, we prove three major advantages of proposed scheme. First of all, proposed scheme optimizes and maintains the replication factor effectively. And, proposed scheme minimizes the data transfer load between racks. Finally, proposed scheme reduces the processing time of MapReduce jobs. Acknowledgments This Research was supported by Seokyeong University in References [1] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, (2008), pp [2] White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", (2012). [3] D. Borthakur, "HDFS architecture guide", HADOOP APACHE PROJECT apache. org/common/docs/current/hdfs design. pdf, (2008). [4] A. Thomasian and J. Menon, "RAID5 performance with distributed sparing", Parallel and Distributed Systems, IEEE Transactions on, vol. 8, no. 6, (1997), pp [5] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters", Communications of the ACM, vol. 51, no. 1, (2008), pp [6] "Hadoop" [Online]. Available: [7] K. Shvachko, "The hadoop distributed file system", Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, (2010). [8] S. Mahadev, "A survey of distributed file systems", Annual Review of Computer Science, vol. 4, no. 1, (1990), pp [9] Q. Wei, "CDRM: A cost-effective dynamic replication management scheme for cloud storage cluster", Cluster Computing (CLUSTER), 2010 IEEE International Conference on. IEEE, (2010). [10] J. Xiong, "Improving data availability for a cluster file system through replication", Parallel and Distributed Processing, IPDPS IEEE International Symposium on. IEEE, (2008). [11] Abad, L. Cristina, Y. Lu, and R. H. Campbell, "DARE: Adaptive data replication for efficient cluster scheduling", Cluster Computing (CLUSTER), 2011 IEEE International Conference on. Ieee, (2011). [12] Khanli, L. Mohammad, A. Isazadeh, and T. N. Shishavan, "PHFS: A dynamic replication method, to decrease access latency in the multi-tier data grid", Future Generation Computer Systems, vol. 27, no. 3, (2011), pp [13] S. Seo, "HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment", Cluster Computing and Workshops, CLUSTER'09. IEEE International Conference on. IEEE, (2009). [14] X. Zhang, "An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments." Cloud and Service Computing (CSC), 2011 International Conference on. IEEE, (2011). [15] M. Zaharia, "Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling", Proceedings of the 5th European conference on Computer systems. ACM, (2010). [16] T. White, Hadoop: The definitive guide, "O'Reilly Media, Inc.", (2012). Authors Jungha Lee received her B.E. in Information and Communication Engineering from the Seokyong University, Korea, in She received the M.S. in Computer Education from the Korea University, Korea, in Since 2013 she is a researcher in the Supercomputing Service Center, Korea Institute of Science and Technology Information(KISTI). Her research interests lie in distributed file systems, high throughput computing, and cloud computing, Copyright c 2015 SERSC 185

10 Jaehwa Chung is an assistant professor at Dept. of Computer Science in Korea National Open University. He received M.S. and Ph.D. degrees at Dept. of Computer Science Education in Korea University, Korea. His research interests include spatial query and index, spatio-temporal database, mobile data management, locationbased services, Spark, WSNs and mobile data mining. Daewon Lee received his B.S. in division of Electricity and Electronic Engineering from Soonchunhyang University, Asan, ChungNam, Korea in He received his M.E. and Ph.D. degrees in Computer Science Education from Korea University, Seoul, Korea in 2003 and He is currently a full time lecturer in the Division of General Education at SeoKyeong University in Korea. His research interests are in IoT, Mobile computing, Distributed computing, Cloud computing, and Fault-tolerant systems. 186 Copyright c 2015 SERSC

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Research on Job Scheduling Algorithm in Hadoop

Research on Job Scheduling Algorithm in Hadoop Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application 2012 International Conference on Information and Computer Applications (ICICA 2012) IPCSIT vol. 24 (2012) (2012) IACSIT Press, Singapore A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs

More information

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT 1 SEUNGHO HAN, 2 MYOUNGJIN KIM, 3 YUN CUI, 4 SEUNGHYUN SEO, 5 SEUNGBUM SEO, 6 HANKU LEE 1,2,3,4,5 Department

More information

Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

Cloud Storage Solution for WSN Based on Internet Innovation Union

Cloud Storage Solution for WSN Based on Internet Innovation Union Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,

More information

Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Matchmaking: A New MapReduce Scheduling Technique

Matchmaking: A New MapReduce Scheduling Technique Matchmaking: A New MapReduce Scheduling Technique Chen He Ying Lu David Swanson Department of Computer Science and Engineering University of Nebraska-Lincoln Lincoln, U.S. {che,ylu,dswanson}@cse.unl.edu

More information

Secret Sharing based on XOR for Efficient Data Recovery in Cloud

Secret Sharing based on XOR for Efficient Data Recovery in Cloud Secret Sharing based on XOR for Efficient Data Recovery in Cloud Computing Environment Su-Hyun Kim, Im-Yeong Lee, First Author Division of Computer Software Engineering, Soonchunhyang University, kimsh@sch.ac.kr

More information

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Wen-Hsu HSIEH *, San-Peng KAO **, Kuang-Hung TAN **, Jiann-Liang CHEN ** * Department of Computer and Communication, De Lin Institute

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Residual Traffic Based Task Scheduling in Hadoop

Residual Traffic Based Task Scheduling in Hadoop Residual Traffic Based Task Scheduling in Hadoop Daichi Tanaka University of Tsukuba Graduate School of Library, Information and Media Studies Tsukuba, Japan e-mail: s1421593@u.tsukuba.ac.jp Masatoshi

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Research Article Hadoop-Based Distributed Sensor Node Management System

Research Article Hadoop-Based Distributed Sensor Node Management System Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung

More information

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce 2012 Third International Conference on Networking and Computing Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi

More information

Design of Electric Energy Acquisition System on Hadoop

Design of Electric Energy Acquisition System on Hadoop , pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University

More information

Distributed Consistency Method and Two-Phase Locking in Cloud Storage over Multiple Data Centers

Distributed Consistency Method and Two-Phase Locking in Cloud Storage over Multiple Data Centers BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 15, No 6 Special Issue on Logistics, Informatics and Service Science Sofia 2015 Print ISSN: 1311-9702; Online ISSN: 1314-4081

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Scalable Multiple NameNodes Hadoop Cloud Storage System

Scalable Multiple NameNodes Hadoop Cloud Storage System Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai

More information

Adaptive Task Scheduling for MultiJob MapReduce Environments

Adaptive Task Scheduling for MultiJob MapReduce Environments Adaptive Task Scheduling for MultiJob MapReduce Environments Jordà Polo, David de Nadal, David Carrera, Yolanda Becerra, Vicenç Beltran, Jordi Torres and Eduard Ayguadé Barcelona Supercomputing Center

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION Speed- Up Extension To Hadoop System- A Survey Of HDFS Data Placement Sayali Ashok Shivarkar, Prof.Deepali Gatade Computer Network, Sinhgad College of Engineering, Pune, India 1sayalishivarkar20@gmail.com

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Analysis of Information Management and Scheduling Technology in Hadoop

Analysis of Information Management and Scheduling Technology in Hadoop Analysis of Information Management and Scheduling Technology in Hadoop Ma Weihua, Zhang Hong, Li Qianmu, Xia Bin School of Computer Science and Technology Nanjing University of Science and Engineering

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Distributed File System. Jordan Prosch, Matt Kipps Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?

More information

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

More information

A Game Theory Based MapReduce Scheduling Algorithm

A Game Theory Based MapReduce Scheduling Algorithm A Game Theory Based MapReduce Scheduling Algorithm Ge Song 1, Lei Yu 2, Zide Meng 3, Xuelian Lin 4 Abstract. A Hadoop MapReduce cluster is an environment where multi-users, multijobs and multi-tasks share

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2

More information

Big Data: Study in Structured and Unstructured Data

Big Data: Study in Structured and Unstructured Data Big Data: Study in Structured and Unstructured Data Motashim Rasool 1, Wasim Khan 2 mail2motashim@gmail.com, khanwasim051@gmail.com Abstract With the overlay of digital world, Information is available

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Query and Analysis of Data on Electric Consumption Based on Hadoop

Query and Analysis of Data on Electric Consumption Based on Hadoop , pp.153-160 http://dx.doi.org/10.14257/ijdta.2016.9.2.17 Query and Analysis of Data on Electric Consumption Based on Hadoop Jianjun 1 Zhou and Yi Wu 2 1 Information Science and Technology in Heilongjiang

More information

MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM

MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM Julia Myint 1 and Thinn Thu Naing 2 1 University of Computer Studies, Yangon, Myanmar juliamyint@gmail.com 2 University of Computer

More information

ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction:

ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction: ISSN:2320-0790 Dynamic Data Replication for HPC Analytics Applications in Hadoop Ragupathi T 1, Sujaudeen N 2 1 PG Scholar, Department of CSE, SSN College of Engineering, Chennai, India 2 Assistant Professor,

More information

AptStore: Dynamic Storage Management for Hadoop

AptStore: Dynamic Storage Management for Hadoop AptStore: Dynamic Storage Management for Hadoop Krish K. R., Aleksandr Khasymski, Ali R. Butt, Sameer Tiwari, Milind Bhandarkar Virginia Tech, Greenplum {kris,khasymskia,butta}@cs.vt.edu, {Sameer.Tiwar,

More information

The Recovery System for Hadoop Cluster

The Recovery System for Hadoop Cluster The Recovery System for Hadoop Cluster Prof. Priya Deshpande Dept. of Information Technology MIT College of engineering Pune, India priyardeshpande@gmail.com Darshan Bora Dept. of Information Technology

More information

Multi-level Metadata Management Scheme for Cloud Storage System

Multi-level Metadata Management Scheme for Cloud Storage System , pp.231-240 http://dx.doi.org/10.14257/ijmue.2014.9.1.22 Multi-level Metadata Management Scheme for Cloud Storage System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3, Chuck Yoo 2 and Young Woong Ko 1

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

The Dynamic Replication Mechanism of HDFS Hot File based on Cloud Storage

The Dynamic Replication Mechanism of HDFS Hot File based on Cloud Storage , pp.439-448 http://dx.doi.org/10.14257/ijsia.2015.9.8.39 The Dynamic Replication Mechanism of HDFS Hot File based on Cloud Storage Mingyong Li*, Yan Ma and Meilian Chen College of Computer and Information

More information

NLSS: A Near-Line Storage System Design Based on the Combination of HDFS and ZFS

NLSS: A Near-Line Storage System Design Based on the Combination of HDFS and ZFS NLSS: A Near-Line Storage System Design Based on the Combination of HDFS and Wei Hu a, Guangming Liu ab, Yanqing Liu a, Junlong Liu a, Xiaofeng Wang a a College of Computer, National University of Defense

More information

Do You Feel the Lag of Your Hadoop?

Do You Feel the Lag of Your Hadoop? Do You Feel the Lag of Your Hadoop? Yuxuan Jiang, Zhe Huang, and Danny H.K. Tsang Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology, Hong Kong Email:

More information

Evaluating Task Scheduling in Hadoop-based Cloud Systems

Evaluating Task Scheduling in Hadoop-based Cloud Systems 2013 IEEE International Conference on Big Data Evaluating Task Scheduling in Hadoop-based Cloud Systems Shengyuan Liu, Jungang Xu College of Computer and Control Engineering University of Chinese Academy

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Presentation of Multi Level Data Replication Distributed Decision Making Strategy for High Priority Tasks in Real Time Data Grids

Presentation of Multi Level Data Replication Distributed Decision Making Strategy for High Priority Tasks in Real Time Data Grids Presentation of Multi Level Data Replication Distributed Decision Making Strategy for High Priority Tasks in Real Time Data Grids Naghmeh Esmaieli Esmaily.naghmeh@gmail.com Mahdi Jafari Ser_jafari@yahoo.com

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

UPS battery remote monitoring system in cloud computing

UPS battery remote monitoring system in cloud computing , pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Telecom Data processing and analysis based on Hadoop

Telecom Data processing and analysis based on Hadoop COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Research on Reliability of Hadoop Distributed File System

Research on Reliability of Hadoop Distributed File System , pp.315-326 http://dx.doi.org/10.14257/ijmue.2015.10.11.30 Research on Reliability of Hadoop Distributed File System Daming Hu, Deyun Chen*, Shuhui Lou and Shujun Pei College of Computer Science and Technology,

More information

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

The Big Data Recovery System for Hadoop Cluster

The Big Data Recovery System for Hadoop Cluster The Big Data Recovery for Hadoop Cluster V. S. Karwande 1, Dr. S. S. Lomte 2, R. A. Auti 3 ME Student, Computer Science and Engineering, EESCOE&T, Aurangabad, India 1 Professor, Computer Science and Engineering,

More information

An Improved Data Placement Strategy in a Heterogeneous Hadoop Cluster

An Improved Data Placement Strategy in a Heterogeneous Hadoop Cluster Send Orders for Reprints to reprints@benthamscience.ae 792 The Open Cybernetics & Systemics Journal, 2015, 9, 792-798 Open Access An Improved Data Placement Strategy in a Heterogeneous Hadoop Cluster Wentao

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Cyber Forensic for Hadoop based Cloud System

Cyber Forensic for Hadoop based Cloud System Cyber Forensic for Hadoop based Cloud System ChaeHo Cho 1, SungHo Chin 2 and * Kwang Sik Chung 3 1 Korea National Open University graduate school Dept. of Computer Science 2 LG Electronics CTO Division

More information

The Improved Job Scheduling Algorithm of Hadoop Platform

The Improved Job Scheduling Algorithm of Hadoop Platform The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: wulinzhi1001@163.com

More information

Mining Interesting Medical Knowledge from Big Data

Mining Interesting Medical Knowledge from Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from

More information

Networking in the Hadoop Cluster

Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

More information

HDFS Space Consolidation

HDFS Space Consolidation HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute

More information

Applied research on data mining platform for weather forecast based on cloud storage

Applied research on data mining platform for weather forecast based on cloud storage Applied research on data mining platform for weather forecast based on cloud storage Haiyan Song¹, Leixiao Li 2* and Yuhong Fan 3* 1 Department of Software Engineering t, Inner Mongolia Electronic Information

More information

Pepper: An Elastic Web Server Farm for Cloud based on Hadoop. Subramaniam Krishnan, Jean Christophe Counio Yahoo! Inc. MAPRED 1 st December 2010

Pepper: An Elastic Web Server Farm for Cloud based on Hadoop. Subramaniam Krishnan, Jean Christophe Counio Yahoo! Inc. MAPRED 1 st December 2010 Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Subramaniam Krishnan, Jean Christophe Counio. MAPRED 1 st December 2010 Agenda Motivation Design Features Applications Evaluation Conclusion

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Improving Current Hadoop MapReduce Workflow and Performance

Improving Current Hadoop MapReduce Workflow and Performance Improving Current Hadoop MapReduce Workflow and Performance Hamoud Alshammari Department of Computer Science, CT, USA Jeongkyu Lee Department of Computer Science, CT, USA Hassan Bajwa Department of Electrical

More information

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani What is Big Data? Concepts, Ideas and Principles Hitesh Dharamdasani # whoami Security Researcher, Malware Reversing Engineer, Developer GIT > George Mason > UC Berkeley > FireEye > On Stage Building Data-driven

More information

Performance Issues of Heterogeneous Hadoop Clusters in Cloud Computing

Performance Issues of Heterogeneous Hadoop Clusters in Cloud Computing Performance Issues of Heterogeneous Hadoop Clusters in Cloud Computing 1 B.Thirumala Rao, 2 N.V.Sridevi, 3 V.Krishna Reddy, 4 L.S.S.Reddy Department of Computer Science and Engineering, Lakireddy Bali

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Big Data Processing with MapReduce for E-Book

Big Data Processing with MapReduce for E-Book Big Data Processing with MapReduce for E-Book Tae Ho Hong 2, Chang Ho Yun 1,2, Jong Won Park 1,2, Hak Geon Lee 2, Hae Sun Jung 1 and Yong Woo Lee 1,2 1 The Ubiquitous (Smart) City Consortium 2 The University

More information