Efficient Data Replication Scheme based on Hadoop Distributed File System
|
|
|
- Angel Logan
- 10 years ago
- Views:
Transcription
1 , pp Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing, Korea Institute of Science and Technology Information, Korea, 2 Dept. of Computer Science, Korea National Open University, Korea, 3 Division of General Education, Seokyeong University, Korea, 1 [email protected], 2 [email protected], 3 [email protected] Abstract Hadoop distributed file system (HDFS) is designed to store huge data set reliably, has been widely used for processing massive-scale data in parallel. In HDFS, the data locality problem is one of critical problem that causes the performance decrement of a file system. To solve the data locality problem, we propose an efficient data replication scheme based on access count prediction in a Hadoop framework. By the previous data access count, the existing data replication scheme predicts the next access count of data files using Lagrange s interpolation. Then, the proposed data replication scheme determines the replication factor with the predicted data access count, whether it generates a new replica or it uses the loaded data as cache selectively. Finally, the proposed scheme provides improvement of data locality. By performance evaluation, proposed efficient data replication scheme is compared with default data replication setting of Hadoop that shows proposed scheme reduces averagely 8.9% of the task completion time in the map phase. Regarding the data locality, proposed scheme provides the increase of node locality by 6.6% and the decrease of rack and rack-off locality by 38.9% and 56.5%. Keywords: Hadoop, Data locality, Access Prediction, Data Replication, Data Placement 1. Introduction Because of requirement of efficient programming model for large scale data in a distributed computing environment, MapReduce is developed by Google [1]. Hadoop is one of the open sources implementation of the MapReduce. Hadoop Distributed File System (HDFS) is developed by Apache Foundation that provides advantages of the power of high-speed computing clusters and storage and high performance in big data storage [1-5]. Using HDFS, high availability and fault-tolerance such as replication are provided. In HDFS, to provide data locality, Hadoop tries to automatically collocate the data with the computing node. Hadoop schedules Map tasks to set the data on same node and the same rack. This is data locality that is a principal factor of Hadoop s performance. [1, 6]. In Hadoop scheduling policy, there is the case of the data locality problem that can occur, when the assigned node load the data block from another node. The main factor of data locality in Hadoop refers to the distance between data and the assigned node. There are three types of data locality in Hadoop as follows: (1) Node locality: when data for processing are stored in the local storage, * Corresponding Author ISSN: IJSEIA Copyright c 2015 SERSC
2 (2) Rack locality: when data for processing are not stored in the local storage, but another node within the same rack, (3) Rack-off locality: when data for processing are not stored in the local storage and nodes within the same rack, but another node in a different rack. In this paper, we propose an efficient data replication scheme in a Hadoop framework based on access count prediction. Proposed data replication scheme focused on improving the data locality in the map phase, and reducing the overhead of data transfer that increases total execution time. In proposed data replication scheme, we efficiently determined the increasing replication factor and avoiding the unnecessary data replication. The contributions of this paper are as follows. Proposed scheme optimizes and maintains the replication factor effectively by proposed predictor. Proposed scheme minimizes the data transfer load between racks by proposed data replication algorithm. Proposed scheme reduces the processing time of MapReduce jobs by improvement of data locality. The rest of this paper is organized as follows. In Section 2, we discuss previous works on data locality in a MapReduce framework and introduce problems of data locality. The proposed efficient data replication scheme based on access count prediction is presented in Section 3. Section 4 shows the performance evaluation of proposed scheme. Finally, Section 5 gives our conclusions. 2. Related Works 2.1. Previous Works There are several previous researches of data replication in HDFS. Improving fault tolerance, [6-9] use data replication in HDFS. They are mainly focused on fault tolerance to overcome unexpected failures. Recently, some of the research [10-12] are focused on improving data locality for efficient execution on data replication in Hadoop. Also, some scheduled research [13-14] are proposed to improve data locality. Table 1 shows feature of data replication schemes and scheduling methods. Table 1. Previous Data Replication Schemes and Scheduling Methods Data Replication Scheme Scheme Technique Weakness [10] dynamic data replication access patterns remove replicated data [11] data placement balancing for requirement depend on application [12] data prefetching prediction by log depend on log data [13] data locality aware of scheduling Scheduling Methods waiting time estimation / data transfer time [14] delay scheduling base on delay 2.2. Data Locality Problem This section describes the data locality problem and types of data locality in Hadoop. Data locality related with the distance between data and the processing node. So, if the 178 Copyright c 2015 SERSC
3 closer distance between data and node, it has the better data locality. Figure 1 shows three types of data locality in Hadoop: node locality, rack locality, and rack-off locality. Figure 1. Example of Data Locality The data locality problem can be defined as a situation that a task is scheduled with rack or rack-off locality. It could cause poor performance. Regarding the data locality, the overhead of rack-off locality is greater than overhead of rack locality. To prevent the data locality problem, we propose an efficient data replication scheme using prediction by the access count of data files and a data replication placement algorithm reducing case of rack and rack-off locality. 3. Efficient Data Replication Scheme The diagram of a MapReduce is shown at Figure 2. The proposed modules are marked with red dotted rectangle in HDFS Access Count Prediction Figure 2. Diagram of Efficient Data Replication The basic idea of determining replication is based on the different replication factor per data file. Too much replication does not always guarantee the better data locality. However, the probability with node locality is getting higher when the replication is enough to access. To determine the efficient replication factor, a prediction method is demanded to forecast the next access counts of data. To accomplish this work, the amount of access counts that changes over time could be expressed as a mathematical formula. Copyright c 2015 SERSC 179
4 However, because of randomly access to a data, a constant function is not considerable. Therefore, we apply Lagrange s interpolation using a polynomial expression to extract predicted access count of data. The mathematical formula is given below: In the equation (1), Let N be the number of points, Let x i be the the i th point, and Let f i be the function of x i. To calculate the predicted access count, substitute x by time t, where t is time of access occurred y by an access count at t. Table 2 shows proposed access count prediction algorithm. In this algorithm, t i is the time at which i th access is made, avg is the average time interval between accesses, and AccessList is the access count at t i. AccessPrediction(AccessList[ ]) /* Step 1. Initialization of the variables */ int sum = 0, Threshold = 0, temp=0; float tempx = 1.0; Table 2. Access Count Prediction Algorithm /* Step 2. Calculation of the average time interval */ for (i = 0; i < n; i++){ // n is the number of time stamps. temp = t i+1 t i ; sum = sum + temp; avg = sum / n /* Step 3. Prediction of the next access */ t next = t n-1 + avg /* Step 4. Calculation of the number of future access */ For (i = 0; i < n; i++){ For (j = 0; j < n; j++){ if (i =! j){ temp = (t next t j )/( t i t j ); tempx =tempx x temp; temp = tempx x AccessList[i]; Threshold = Threshold + temp; return Threshold; (1) 180 Copyright c 2015 SERSC
5 3.2. Efficient Data Replication and Replica Placement This subsection describes the efficient data replication algorithm that based on access count prediction. The proposed algorithm compares the access count by the current replication factor and the accessed replication factor by prediction. Furthermore, we present the replica placement algorithm to effectively reduce the number of nodes with rack or rack-off locality. Table 3 shows the efficient data replication algorithm based on access count prediction. F i means the i th file, Demand i means the demand count of the i th file, and replica i means the replication factor of the i th file. AdaptiveDataReplication( ) /* Step 1. Requesting a task */ if(# of TaskTracker s idle slots >= 1){ request a task to JobTracker; Table 3. Efficient Data Replication Algorithm /* Step 2. Checking the data locality of tasks */ if( # of tasks with the node locality >= 1){ assign a task of them; else if(# of tasks with the rack locality >= 1){ assign a task of them; else{ assign a task with the rack-off locality; /* Step 3. Increasing the number of accesses */ if( the assignment in Step 2 is the first assignment for the job){ for(i =0; i > F; i++) { // F is the number of job files. Access i = Access i + 1; // Access i is the access count of i-th file. /* Step 4. Obtaining the value of Threshold */ Threshold = AccessPrediction(AccessList); // AccessList includes from Access 0 to Access n. /* Step 5. Creating the caches or replicas */ for (i =0; i > F; i++){ if(replica i >= Threshold){ create some of the cache when task has no node locality; else if(replica i < Threshold){ create the replica of the corresponding file; Table 4 shows the replica placement algorithm for data locality. In this algorithm, Rack i means the i th rack, Rack selected means the current selected rack, node inturen means the current selected node, and Replica n means the n th replica. Propose replica placement algorithm is focused on improving data locality especially for the rack-off locality. Copyright c 2015 SERSC 181
6 Table 4. Replica Placement Algorithm ReplicaPlacement( ) /* Step 1. Selection of a rack to store the replica*/ for(rack i belong to the circular linked list of racks){ // R is the number of racks. if(replica n does not exist in the Rack i ){ Rack seleted = Rack i ; goto Step 2; if(all the racks have Replica n ){ Rack seleted = the first rack to be searched; /* Step 2. Selection of a node to store the replica */ select the Node inturn belong to the circular linked list of nodes on the Rack selected ; /* Step 3. Store of the replica */ store Replica n to Node inturn on the Rack selected ; register the information of Replica n to NameNode; 4. Performance Evaluation 4.1. Evaluation Environment In evaluation, the Hadoop cluster composes one master node and eight slave nodes and Hadoop Each node consists with Intel Core i5 CPU and 8GM RAM. Within a single rack, nodes are connected by Giga bit Ethernet switches. And, between racks, fast Ethernet routers are used. We conduct the wordcount application with variegation of input data sizes: 1.3GB, 1.9GB, 2.5GB, 3.2GB, and 4.4GB. Based on the logs of real job trace, we evaluate our efficient data replication scheme compared with the default setting of Hadoop replication. Table 4. Configurations of Simulation Cluster configuration HDFS configuration Hadoop configuration Number of master node 1 Number of slave nodes 100 Number of racks 3 Number of replicas 3 Block size 64MB File1 size 3.2GB File2 size 1.9GB File3 size 1.3GB File4 size 2.5GB File5 size 4.4GB Scheduler Fair scheduler Number of concurrent jobs 6 Number of shared files Copyright c 2015 SERSC
7 4.2. Performance Results Figure 3 shows comparison of the map phase completion time of map phase between proposed scheme and Hadoop default. For 6 jobs, 216 map tasks are spawned. The average completion time of the map phase in Hadoop default is 90.5 seconds, whereas average completion time of proposed scheme is 81 seconds that shows 8.9% of performance improvement. Figure 3. Comparison of the Completion Time of Map Phase between Proposed Scheme and Hadoop Default Figure 4 shows the number of map tasks with node, rack, and rack-off locality. In comparison with the Hadoop default, proposed scheme provides the increase of node locality by about 4.5% and the decrease of rack and rack-off locality by about 11.6% and 20.9%. Figure 4. Comparison of Data Locality between Proposed Scheme and Hadoop Default Figure 5, 6 shows the number of map tasks with data locality with variegation of slave nodes. The largest enlargement of node locality takes place when the number of slave nodes is 130, in comparison with the Hadoop default. In this case, node locality is increased by about 6.6% and rack and rack-off locality is decreased by about 38.9% and 56.5%. Copyright c 2015 SERSC 183
8 Figure 5. Comparison of Node Locality with Variegation of Nodes 9-a Figure 6. Comparison of Rack/Rackoff Locality with Variegation of Nodes Figure 7 shows comparison of the completion time of map phase with variegation numbers files. The number of Map tasks are 52 for 1 shared file, 83 for 2 shared file, 104 for shared 3 file, 144 for 4 shared file, and 216 for 5 shared file. Figure 7. Comparison of the Completion Time of Map Phase with Variegation of Files 184 Copyright c 2015 SERSC
9 6. Conclusion To solve the data locality problem, we proposed an efficient data replication scheme in a Hadoop framework. Proposed efficient data replication scheme aims at improving the data locality in the map task phase, and reducing the total processing time. By prediction of access counts of data file, we optimize the replication factor per each data file. Proposed data replication scheme determines generating new replica or using the loaded data as cache. By performance evaluation, we prove three major advantages of proposed scheme. First of all, proposed scheme optimizes and maintains the replication factor effectively. And, proposed scheme minimizes the data transfer load between racks. Finally, proposed scheme reduces the processing time of MapReduce jobs. Acknowledgments This Research was supported by Seokyeong University in References [1] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, vol. 51, no. 1, (2008), pp [2] White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", (2012). [3] D. Borthakur, "HDFS architecture guide", HADOOP APACHE PROJECT apache. org/common/docs/current/hdfs design. pdf, (2008). [4] A. Thomasian and J. Menon, "RAID5 performance with distributed sparing", Parallel and Distributed Systems, IEEE Transactions on, vol. 8, no. 6, (1997), pp [5] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters", Communications of the ACM, vol. 51, no. 1, (2008), pp [6] "Hadoop" [Online]. Available: [7] K. Shvachko, "The hadoop distributed file system", Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, (2010). [8] S. Mahadev, "A survey of distributed file systems", Annual Review of Computer Science, vol. 4, no. 1, (1990), pp [9] Q. Wei, "CDRM: A cost-effective dynamic replication management scheme for cloud storage cluster", Cluster Computing (CLUSTER), 2010 IEEE International Conference on. IEEE, (2010). [10] J. Xiong, "Improving data availability for a cluster file system through replication", Parallel and Distributed Processing, IPDPS IEEE International Symposium on. IEEE, (2008). [11] Abad, L. Cristina, Y. Lu, and R. H. Campbell, "DARE: Adaptive data replication for efficient cluster scheduling", Cluster Computing (CLUSTER), 2011 IEEE International Conference on. Ieee, (2011). [12] Khanli, L. Mohammad, A. Isazadeh, and T. N. Shishavan, "PHFS: A dynamic replication method, to decrease access latency in the multi-tier data grid", Future Generation Computer Systems, vol. 27, no. 3, (2011), pp [13] S. Seo, "HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment", Cluster Computing and Workshops, CLUSTER'09. IEEE International Conference on. IEEE, (2009). [14] X. Zhang, "An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments." Cloud and Service Computing (CSC), 2011 International Conference on. IEEE, (2011). [15] M. Zaharia, "Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling", Proceedings of the 5th European conference on Computer systems. ACM, (2010). [16] T. White, Hadoop: The definitive guide, "O'Reilly Media, Inc.", (2012). Authors Jungha Lee received her B.E. in Information and Communication Engineering from the Seokyong University, Korea, in She received the M.S. in Computer Education from the Korea University, Korea, in Since 2013 she is a researcher in the Supercomputing Service Center, Korea Institute of Science and Technology Information(KISTI). Her research interests lie in distributed file systems, high throughput computing, and cloud computing, Copyright c 2015 SERSC 185
10 Jaehwa Chung is an assistant professor at Dept. of Computer Science in Korea National Open University. He received M.S. and Ph.D. degrees at Dept. of Computer Science Education in Korea University, Korea. His research interests include spatial query and index, spatio-temporal database, mobile data management, locationbased services, Spark, WSNs and mobile data mining. Daewon Lee received his B.S. in division of Electricity and Electronic Engineering from Soonchunhyang University, Asan, ChungNam, Korea in He received his M.E. and Ph.D. degrees in Computer Science Education from Korea University, Seoul, Korea in 2003 and He is currently a full time lecturer in the Division of General Education at SeoKyeong University in Korea. His research interests are in IoT, Mobile computing, Distributed computing, Cloud computing, and Fault-tolerant systems. 186 Copyright c 2015 SERSC
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT
METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT 1 SEUNGHO HAN, 2 MYOUNGJIN KIM, 3 YUN CUI, 4 SEUNGHYUN SEO, 5 SEUNGBUM SEO, 6 HANKU LEE 1,2,3,4,5 Department
Research on Job Scheduling Algorithm in Hadoop
Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application
2012 International Conference on Information and Computer Applications (ICICA 2012) IPCSIT vol. 24 (2012) (2012) IACSIT Press, Singapore A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
http://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
Matchmaking: A New MapReduce Scheduling Technique
Matchmaking: A New MapReduce Scheduling Technique Chen He Ying Lu David Swanson Department of Computer Science and Engineering University of Nebraska-Lincoln Lincoln, U.S. {che,ylu,dswanson}@cse.unl.edu
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Cloud Storage Solution for WSN Based on Internet Innovation Union
Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,
Secret Sharing based on XOR for Efficient Data Recovery in Cloud
Secret Sharing based on XOR for Efficient Data Recovery in Cloud Computing Environment Su-Hyun Kim, Im-Yeong Lee, First Author Division of Computer Software Engineering, Soonchunhyang University, [email protected]
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Research Article Hadoop-Based Distributed Sensor Node Management System
Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung
Processing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Design of Electric Energy Acquisition System on Hadoop
, pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University
Adaptive Task Scheduling for Multi Job MapReduce
Adaptive Task Scheduling for MultiJob MapReduce Environments Jordà Polo, David de Nadal, David Carrera, Yolanda Becerra, Vicenç Beltran, Jordi Torres and Eduard Ayguadé Barcelona Supercomputing Center
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
Scalable Multiple NameNodes Hadoop Cloud Storage System
Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Analysis of Information Management and Scheduling Technology in Hadoop
Analysis of Information Management and Scheduling Technology in Hadoop Ma Weihua, Zhang Hong, Li Qianmu, Xia Bin School of Computer Science and Technology Nanjing University of Science and Engineering
Multi-level Metadata Management Scheme for Cloud Storage System
, pp.231-240 http://dx.doi.org/10.14257/ijmue.2014.9.1.22 Multi-level Metadata Management Scheme for Cloud Storage System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3, Chuck Yoo 2 and Young Woong Ko 1
MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM
MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM Julia Myint 1 and Thinn Thu Naing 2 1 University of Computer Studies, Yangon, Myanmar [email protected] 2 University of Computer
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Hadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
Hadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Big Data: Study in Structured and Unstructured Data
Big Data: Study in Structured and Unstructured Data Motashim Rasool 1, Wasim Khan 2 [email protected], [email protected] Abstract With the overlay of digital world, Information is available
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction:
ISSN:2320-0790 Dynamic Data Replication for HPC Analytics Applications in Hadoop Ragupathi T 1, Sujaudeen N 2 1 PG Scholar, Department of CSE, SSN College of Engineering, Chennai, India 2 Assistant Professor,
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
The Improved Job Scheduling Algorithm of Hadoop Platform
The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: [email protected]
NLSS: A Near-Line Storage System Design Based on the Combination of HDFS and ZFS
NLSS: A Near-Line Storage System Design Based on the Combination of HDFS and Wei Hu a, Guangming Liu ab, Yanqing Liu a, Junlong Liu a, Xiaofeng Wang a a College of Computer, National University of Defense
Evaluating Task Scheduling in Hadoop-based Cloud Systems
2013 IEEE International Conference on Big Data Evaluating Task Scheduling in Hadoop-based Cloud Systems Shengyuan Liu, Jungang Xu College of Computer and Control Engineering University of Chinese Academy
UPS battery remote monitoring system in cloud computing
, pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology
Presentation of Multi Level Data Replication Distributed Decision Making Strategy for High Priority Tasks in Real Time Data Grids
Presentation of Multi Level Data Replication Distributed Decision Making Strategy for High Priority Tasks in Real Time Data Grids Naghmeh Esmaieli [email protected] Mahdi Jafari [email protected]
Telecom Data processing and analysis based on Hadoop
COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
Query and Analysis of Data on Electric Consumption Based on Hadoop
, pp.153-160 http://dx.doi.org/10.14257/ijdta.2016.9.2.17 Query and Analysis of Data on Electric Consumption Based on Hadoop Jianjun 1 Zhou and Yi Wu 2 1 Information Science and Technology in Heilongjiang
Cyber Forensic for Hadoop based Cloud System
Cyber Forensic for Hadoop based Cloud System ChaeHo Cho 1, SungHo Chin 2 and * Kwang Sik Chung 3 1 Korea National Open University graduate school Dept. of Computer Science 2 LG Electronics CTO Division
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Applied research on data mining platform for weather forecast based on cloud storage
Applied research on data mining platform for weather forecast based on cloud storage Haiyan Song¹, Leixiao Li 2* and Yuhong Fan 3* 1 Department of Software Engineering t, Inner Mongolia Electronic Information
Cloud Computing based Livestock Monitoring and Disease Forecasting System
, pp.313-320 http://dx.doi.org/10.14257/ijsh.2013.7.6.30 Cloud Computing based Livestock Monitoring and Disease Forecasting System Seokkyun Jeong 1, Hoseok Jeong 2, Haengkon Kim 3 and Hyun Yoe 4 1,2,4
HDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani
What is Big Data? Concepts, Ideas and Principles Hitesh Dharamdasani # whoami Security Researcher, Malware Reversing Engineer, Developer GIT > George Mason > UC Berkeley > FireEye > On Stage Building Data-driven
Cloud Computing based on the Hadoop Platform
Cloud Computing based on the Hadoop Platform Harshita Pandey 1 UG, Department of Information Technology RKGITW, Ghaziabad ABSTRACT In the recent years,cloud computing has come forth as the new IT paradigm.
Improving Current Hadoop MapReduce Workflow and Performance
Improving Current Hadoop MapReduce Workflow and Performance Hamoud Alshammari Department of Computer Science, CT, USA Jeongkyu Lee Department of Computer Science, CT, USA Hassan Bajwa Department of Electrical
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services
RESEARCH ARTICLE Adv. Sci. Lett. 4, 400 407, 2011 Copyright 2011 American Scientific Publishers Advanced Science Letters All rights reserved Vol. 4, 400 407, 2011 Printed in the United States of America
The Big Data Recovery System for Hadoop Cluster
The Big Data Recovery for Hadoop Cluster V. S. Karwande 1, Dr. S. S. Lomte 2, R. A. Auti 3 ME Student, Computer Science and Engineering, EESCOE&T, Aurangabad, India 1 Professor, Computer Science and Engineering,
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
Big Data Processing with MapReduce for E-Book
Big Data Processing with MapReduce for E-Book Tae Ho Hong 2, Chang Ho Yun 1,2, Jong Won Park 1,2, Hak Geon Lee 2, Hae Sun Jung 1 and Yong Woo Lee 1,2 1 The Ubiquitous (Smart) City Consortium 2 The University
MEASURING PERFORMANCE OF DYNAMIC LOAD BALANCING ALGORITHMS IN DISTRIBUTED COMPUTING APPLICATIONS
MEASURING PERFORMANCE OF DYNAMIC LOAD BALANCING ALGORITHMS IN DISTRIBUTED COMPUTING APPLICATIONS Priyesh Kanungo 1 Professor and Senior Systems Engineer (Computer Centre), School of Computer Science and
Adaptive Load Balancing Method Enabling Auto-Specifying Threshold of Node Load Status for Apache Flume
, pp. 201-210 http://dx.doi.org/10.14257/ijseia.2015.9.2.17 Adaptive Load Balancing Method Enabling Auto-Specifying Threshold of Node Load Status for Apache Flume UnGyu Han and Jinho Ahn Dept. of Comp.
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
