COST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS
|
|
|
- Loraine Hines
- 10 years ago
- Views:
Transcription
1 COST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS Ms. T. Cowsalya PG Scholar, SVS College of Engineering, Coimbatore, Tamilnadu, India Dr. S. Senthamarai Kannan Assistant Professor (SG) SVS College of Engineering, Coimbatore, Tamilnadu, India Abstract Information is increasingly important in our daily lives. We need information when and where it is required. We access the internet everyday to perform searches, to participate in social network, send and receive , share pictures and videos and other applications. Therefore cost minimization for processing data has become an important issue in big data era. Here considered three factors like Data assignment, data placement and data movement which manipulate the operational expenses of data centers. In many scenarios, data are, however, geographically distributed across multiple data centers, and easily moving all data to a single data center before processing it can be expensive. We evaluate possible ways of executing such jobs, and propose Two Dimensional Markov Chain Models that can be used to resolve schedules for job sequences which are optimized either with respect 3. to execution time or financial cost. Keywords Minimization, Mapreduce, Data Centers. I. INTRODUCTION BIG data analysis is one of the key challenges of current era. The restrictions to what able to be done are often times due to how much amount of data can be processed in a given period of time. Big data sets innately occur due to applications generating more information to get better operation, performance; general applications like social networks supports every individual users in producing massive amounts of data. MapReduce is a programming model and an connected implementation for processing and producing massive data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is collected of a Map () method that performs filtering and sorting and a Reduce () method that performs a summary operation. The MapReduce System arrange the processing by running the various tasks in parallel, administrating all data transfers between the variety of parts of the system. It provides redundancy and fault tolerance. Hadoop is an important tool used in Big Data with its two important components as MapReduce and Hadoop Distributed File System. MapReduce is used to process the data whereas Hadoop Distributed File System is used to store the data nodes in the cluster. MapReduce is used to process the data whereas Hadoop Distributed File System is used to store the data nodes in the cluster. A. 1.1 Geo-Distributed data center 1. Data centers that are spread at different geographic regions, e.g., Google s 13 data centers over 8 countries in 4 continents[1].the main objective of this paper is toto place these data chunks in the server.to distribute tasks onto servers. 2. to move data between data centres. In step 1 Volley system [2] is used that chooses the datacenter to place the data chunks into the server. Step 2 chooses the optimized server to process the data that is stored in the data center. In step 3 two dimensional Markov Chain model is used i.e. a data chunk is processed in the first data center and the output is passed as an input to the next data center. Likewise all the data chunks are processed in different geographically distributed data center. Many hard works have been made to minimize the computation or data movement cost of data centers. Data center resizing (DCR) has been projected to minimize the processing cost by adjusting the number of activated servers through task placement [3].Although there are some advantage, it is far from achieving the cost efficient big data processing because of the following weaknesses like data locality may result in waste of resources. 1. The link in network may vary on transmission rate and costs. To overcome above problems, we study the cost minimization problem of big data computation through joint optimization of data assignment, data placement, and data movement in geographically distributed data centers. Servers are equipped with limited storage and computation resources. Our objective is to optimize the big data placement, task assignment, routing Corresponding Author: Ms. T. Cowsalya, SVS College of Engineering, Coimbatore, Tamilnadu, India. 638
2 and DCR such that the overall computation and communication cost is minimized. II. RELATED WORK 2.1 Minimizing Server cost Numerous of large-scale data centers are deployed providing services to large users. As suggested in [11], a data center may contain big servers and guzzle high power. Millions of dollars cost on electricity have cause a serious trouble on the operating cost to data center providers. Therefore, minimizing the electricity cost has established major attention from both academia and industry [4], [] [7]. Data Center Resizing and data placement are generally jointly measured to match the processing requirement. [8] Suggest the best workload control by taking account of latency, energy expenditure and electricity cost. 2.2 Managing Big Data To undertake the challenges of successfully managing big data, many decisions have been proposed to recover the storage and processing cost. The advantage in managing big data is reliable and efficient data placement. [9] use of flexibility in the data placement policy to boost energy efficiency in data centers and propose a scheduling algorithm. In addition, allocation of computer resources to task has also strained much concentration. Cohen et al. [10] developed new design attitude, techniques and knowledge providing a new magnetic, agile and deep data analytics for one of the world s major marketing networks at Fox Audience Network, by using Greenplum parallel database system. 3.2 Data Replication An application can specify the number of replicas of a file at the time it is created, and this number can be changed any time after that. The name node makes all decisions concerning block replication. HDFS uses an intelligent replica placement model for reliability and performance. Optimizing replica placement makes HDFS unique from most other distributed file systems, and is facilitated by a rack-aware replica placement policy that uses network bandwidth efficiently. IV. SYSTEM MODEL 4.1 Network Model Geo Distributed data center topology is considered, in which all the available servers of the same data center (DC) are related to their local switch, while the different data centers will communicate through switches. A set I data centers, and each data center i I consists of a set of servers Ji that are connected to a switch mi M with a local transmission cost of CL. The entire system can be represented as a directed graph G = (N,E). The vertex set N = M U J includes the set of all switches and the set of all servers, and E is the directional edge set. All servers are connected to their local switch. while the switches are connected via inter-data center links. The weight of individual link W (u,v) can be defined as W (u,v) = C R, if u,v M C L, otherwise III. HADOOP DISTRIBUTED FILE SYSTEM The Hadoop Distributed File System (HDFS) a subproject of the Apache Hadoop project is a distributed, highly faulttolerant file system designed to run on low-cost commodity hardware. HDFS provides high-throughput access to application data and is suitable for applications with large data sets. This article explores the primary features of HDFS and provides a high-level view of the HDFS architecture. 3.1 Name Node and Data Node Within HDFS, a given name node manages file system namespace operations like opening, closing, and renaming files and directories. A name node also maps data blocks to data nodes, which handle read and write requests from HDFS clients. Data nodes also create, delete, and replicate data blocks according to instructions from the governing name node. Name Node and Data Node sends message each other to prove their identity D D Fig 1: Data Center Topology 4.2 Task Placement Suggest an automated task placement method Volley for geographically distributed cloud services with the reflection of WAN bandwidth cost, capacity limits of data center and interdependencies between data. Cloud computing services use DC Corresponding Author: Ms. T. Cowsalya, SVS College of Engineering, Coimbatore, Tamilnadu, India. 639
3 Volley by submitting datacenter requests as logs. Volley system analyzes such logs by using an iterative algorithm. 4.3 Task Model Big data tasks targeting on data stored in a distributed file system that is built on geographically distributed data centers. The given data is divided into a set of N chunks. Each chunk n N has the size of Φn(Φn<=1) which are then normalized to the server storage ability. Z-way replica is used that is, for each chunk, Z copies will be stored in the distributed file system for fault-tolerance. the states in the diagonal line except (0,0). In region 3: all remaining states in the central region. VI. PERFORMANCE EVALUATION We analyzed the performance of joint-optimization algorithm and also we compare it with another optimization scheme algorithm ( Non-joint ), which it finds minimum number of servers to be activated and the data routing scheme using the network model. V. CONSTRAINTS ON BIG DATA PROCESSING There are three constraints on big data processing they are constraints on task placement, data loading and QOS constraints..1 Constraints of task Placement A binary variable Yjk is defined to denote whether data chunk n is placed on server j as follows yjk = 1; if chunk k is placed on server j; 0; Otherwise.2 Constraints on remote data loading When a chunk is required by the server there may be a internal or external data transmission. The routing information is done by the switch. All the nodes in the graph can be separated into three categories like source node, intermediate node and a destination node. Source nodes are the servers in which the data chunks are stored. Intermediate nodes receive the data from the source node and forwards to the destination node based on t routing strategies. When the required data chunk is not stored in the destination node, it must receive the data chunks in the rate the request rate for chunk on server and the CPU usage of the chunk on the server..3 Constraints on QOS The processing rate and loading rate for data chunks on the server be µjk and γjn. The processing can be based on Two Dimensional Markov Chain model, where each state (u,v) represents u as pending task and v as available task. Let θjk be the computation resources. The processing rate of the task proportional to processing resource usage and it is represented as µjn = αj.θjn. The given graph is divided into three regions; Region 1: all the states in the first line except the state (0,0). State (p,0),(p>1) transmits the data to its two neighboring nodes. In region 2: all The given data is preprocessed first and the preprocessed data is given into the geographically distributed data center. we consider J = 6 data centers, each of which is with the same number of servers. The data center link communication cost are set as CL = 1 and CR = 4, respectively. The cost Pj on each activated server j is set to 1. The data size, storage requirement, and task arrival rate are all randomly generated. A. 6.1 Performance based on number of servers When the total number of servers increases as shown in fig. 2 the communication cost of both joint and non-joint optimization algorithm will decreases significantly. This is because more tasks and data chunks are placed in the same data center when more servers are placed in the data centers. Hence the communication cost greatly reduced. Server Number of Servers Fig.2. server 6 Corresponding Author: Ms. T. Cowsalya, SVS College of Engineering, Coimbatore, Tamilnadu, India. 640
4 10 QoS. Therefore, the server costs of both algorithms decrease as the delay constraint Comm Communicati on cost cost Number of Servers Data size Fig.3. Communication Fig.. Communication More tasks and their corresponding data chunks can be placed in the same data center, or even in the same server. Further increasing the number of servers will not affect the distributions of tasks or data chunks any more. 6.2 Performance Based on Data Size Fig. 4 shows the cost as a function of the total data lump size from 8.4 to 19. Larger chunk size leads to activating new servers with increased server cost. At the same time, fig. shows more resulting traffic over the links creates higher communication cost. 0 Server cost 4 40 VII FAULT TOLERANT Here fault tolerant plays an important role. Data loss and node failure can be easily identified using heartbeat messages. A heartbeat message is a message sent from an originator to a destination that enables the destination to identify if and when the originator fails or is no longer available. Heartbeat messages are typically sent non-stop on a periodic or recurring basis from the originator's start-up until the originator's shutdown. When the destination identifies a lack of heartbeat messages during an anticipated arrival period, the destination may determine that the originator has failed, shutdown, or is generally no longer available. Heartbeat messages may be used for high-availability and fault tolerance purposes 3 VIII CONCLUSION Data size Fig.4. Server The data chunks are placed in multiple servers and each data chunk is processed separately. when the delay requirement is very small, more servers will be activated to guarantee the In this paper, we jointly study the data placement, data Assignment and data center resizing and data movement to minimize the operational cost in geographically distributed data centers. We first describe about the data processing process using a two-dimensional Markov chain and derive the expected completion time in closed-form. In future the nodes in the cluster can be arranged in hierarchal order by using tree data structure. It reduces the processing time and increases performance. Corresponding Author: Ms. T. Cowsalya, SVS College of Engineering, Coimbatore, Tamilnadu, India. 641
5 References [1] Data Center Locations, centers/inside/locations/index.html. [2] S. Agarwal, J. Dunagan, N. Jain, S. Saroiu, A. Wolman, and H. Bhogan, Volley: Automated Data Placement for Geo-Distributed Cloud Services, in The 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2010, pp [3] L. Rao, X. Liu, L. Xie, and W. Liu, Minimizing Electricity : Optimization of Distributed Internet Data Centers in a Multi- Electricity-Market Environment, in Proceedings of the 29th International Conference on Computer Communications (INFOCOM). IEEE, 2010, pp [4] R. Urgaonkar, B. Urgaonkar, M. J. Neely, and A. Sivasubramaniam, Optimal Power Management Using Stored Energy in Data Centers, in Proceedings of International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). ACM, 2011, pp [] X. Fan, W.-D. Weber, and L. A. Barroso, Power Provisioning for A Warehouse-sized Computer, in Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA). ACM, 2007, pp [6] S. Govindan, A. Sivasubramaniam, and B. Urgaonkar, Benefits and Limitations of Tapping Into Stored Energy for Datacenters, in Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA). ACM, 2011, pp [7] P. X. Gao, A. R. Curtis, B. Wong, and S. Keshav, It s Not Easy Being Green, in Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM). ACM, 2012, pp [8] S. A. Yazd, S. Venkatesan, and N. Mittal, Boosting energy efficiency with mirrored data block replication policy and energy scheduler, SIGOPS Oper. Syst. Rev., vol. 47, no. 2, pp , [9] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton, Mad skills: new analysis practices for big data, Proc. VLDB Endow., vol. 2, no. 2, pp , [10] R. Kaushik and K. Nahrstedt, T*: A data-centric cooling energy costs reduction approach for Big Data analytics cloud, in 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012, pp Corresponding Author: Ms. T. Cowsalya, SVS College of Engineering, Coimbatore, Tamilnadu, India. 642
CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING
Journal homepage: http://www.journalijar.com INTERNATIONAL JOURNAL OF ADVANCED RESEARCH RESEARCH ARTICLE CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING R.Kohila
How To Minimize Cost Of Big Data Processing
1 Cost Minimization for Big Data Processing in Geo-Distributed Data Centers Lin Gu, Student Member, IEEE, Deze Zeng, Member, IEEE, Peng Li, Member, IEEE and Song Guo, Senior Member, IEEE Abstract The explosive
Cost and Energy optimization for Big Data Processing in Geo- Spread Data Centers
Cost and Energy optimization for Big Data Processing in Geo- Spread Data Centers Abstract M. Nithya #1, E. Indra *1 Mailam Engineering College, Mailam #1, *1 [email protected] #1 The high volume
Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
ROUTING ALGORITHM BASED COST MINIMIZATION FOR BIG DATA PROCESSING
ROUTING ALGORITHM BASED COST MINIMIZATION FOR BIG DATA PROCESSING D.Vinotha,PG Scholar,Department of CSE,RVS Technical Campus,[email protected] Dr.Y.Baby Kalpana, Head of the Department, Department
Big Data Processing of Data Services in Geo Distributed Data Centers Using Cost Minimization Implementation
Big Data Processing of Data Services in Geo Distributed Data Centers Using Cost Minimization Implementation A. Dhineshkumar, M.Sakthivel Final Year MCA Student, VelTech HighTech Engineering College, Chennai,
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Hadoop Technology for Flow Analysis of the Internet Traffic
Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction:
ISSN:2320-0790 Dynamic Data Replication for HPC Analytics Applications in Hadoop Ragupathi T 1, Sujaudeen N 2 1 PG Scholar, Department of CSE, SSN College of Engineering, Chennai, India 2 Assistant Professor,
Distributed File Systems
Distributed File Systems Mauro Fruet University of Trento - Italy 2011/12/19 Mauro Fruet (UniTN) Distributed File Systems 2011/12/19 1 / 39 Outline 1 Distributed File Systems 2 The Google File System (GFS)
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
ENERGY EFFICIENT AND REDUCTION OF POWER COST IN GEOGRAPHICALLY DISTRIBUTED DATA CARDS
ENERGY EFFICIENT AND REDUCTION OF POWER COST IN GEOGRAPHICALLY DISTRIBUTED DATA CARDS M Vishnu Kumar 1, E Vanitha 2 1 PG Student, 2 Assistant Professor, Department of Computer Science and Engineering,
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Energy Constrained Resource Scheduling for Cloud Environment
Energy Constrained Resource Scheduling for Cloud Environment 1 R.Selvi, 2 S.Russia, 3 V.K.Anitha 1 2 nd Year M.E.(Software Engineering), 2 Assistant Professor Department of IT KSR Institute for Engineering
QUALITY OF SERVICE METRICS FOR DATA TRANSMISSION IN MESH TOPOLOGIES
QUALITY OF SERVICE METRICS FOR DATA TRANSMISSION IN MESH TOPOLOGIES SWATHI NANDURI * ZAHOOR-UL-HUQ * Master of Technology, Associate Professor, G. Pulla Reddy Engineering College, G. Pulla Reddy Engineering
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
An Efficient Approach for Cost Optimization of the Movement of Big Data
2015 by the authors; licensee RonPub, Lübeck, Germany. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
ADAPTIVE LOAD BALANCING ALGORITHM USING MODIFIED RESOURCE ALLOCATION STRATEGIES ON INFRASTRUCTURE AS A SERVICE CLOUD SYSTEMS
ADAPTIVE LOAD BALANCING ALGORITHM USING MODIFIED RESOURCE ALLOCATION STRATEGIES ON INFRASTRUCTURE AS A SERVICE CLOUD SYSTEMS Lavanya M., Sahana V., Swathi Rekha K. and Vaithiyanathan V. School of Computing,
Indian Journal of Science The International Journal for Science ISSN 2319 7730 EISSN 2319 7749 2016 Discovery Publication. All Rights Reserved
Indian Journal of Science The International Journal for Science ISSN 2319 7730 EISSN 2319 7749 2016 Discovery Publication. All Rights Reserved Perspective Big Data Framework for Healthcare using Hadoop
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Enhancing Map-Reduce Job Execution On Geo-Distributed Data Across Datacenters
Enhancing Map-Reduce Job Execution On Geo-Distributed Data Across Datacenters P. JEBILLA 1, P. JAYASHREE 2 1,2 Department of Computer Science and Engineering, Anna University- MIT Campus, Chennai, India
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
Networking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
A Catechistic Method for Traffic Pattern Discovery in MANET
A Catechistic Method for Traffic Pattern Discovery in MANET R. Saranya 1, R. Santhosh 2 1 PG Scholar, Computer Science and Engineering, Karpagam University, Coimbatore. 2 Assistant Professor, Computer
Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
Packet Flow Analysis and Congestion Control of Big Data by Hadoop
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.456
Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges
Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and
Data Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
A SURVEY ON MAPREDUCE IN CLOUD COMPUTING
A SURVEY ON MAPREDUCE IN CLOUD COMPUTING Dr.M.Newlin Rajkumar 1, S.Balachandar 2, Dr.V.Venkatesakumar 3, T.Mahadevan 4 1 Asst. Prof, Dept. of CSE,Anna University Regional Centre, Coimbatore, [email protected]
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Load Balancing Mechanisms in Data Center Networks
Load Balancing Mechanisms in Data Center Networks Santosh Mahapatra Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 33 {mahapatr,xyuan}@cs.fsu.edu Abstract We consider
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
High Performance Cluster Support for NLB on Window
High Performance Cluster Support for NLB on Window [1]Arvind Rathi, [2] Kirti, [3] Neelam [1]M.Tech Student, Department of CSE, GITM, Gurgaon Haryana (India) [email protected] [2]Asst. Professor,
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
Resource Scalability for Efficient Parallel Processing in Cloud
Resource Scalability for Efficient Parallel Processing in Cloud ABSTRACT Govinda.K #1, Abirami.M #2, Divya Mercy Silva.J #3 #1 SCSE, VIT University #2 SITE, VIT University #3 SITE, VIT University In the
Multi-level Metadata Management Scheme for Cloud Storage System
, pp.231-240 http://dx.doi.org/10.14257/ijmue.2014.9.1.22 Multi-level Metadata Management Scheme for Cloud Storage System Jin San Kong 1, Min Ja Kim 2, Wan Yeon Lee 3, Chuck Yoo 2 and Young Woong Ko 1
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
Big Data Storage Architecture Design in Cloud Computing
Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,
MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT
MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT 1 SARIKA K B, 2 S SUBASREE 1 Department of Computer Science, Nehru College of Engineering and Research Centre, Thrissur, Kerala 2 Professor and Head,
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
Suresh Lakavath csir urdip Pune, India [email protected].
A Big Data Hadoop Architecture for Online Analysis. Suresh Lakavath csir urdip Pune, India [email protected]. Ramlal Naik L Acme Tele Power LTD Haryana, India [email protected]. Abstract Big Data
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 Load Balancing Heterogeneous Request in DHT-based P2P Systems Mrs. Yogita A. Dalvi Dr. R. Shankar Mr. Atesh
International Journal of Applied Science and Technology Vol. 2 No. 3; March 2012. Green WSUS
International Journal of Applied Science and Technology Vol. 2 No. 3; March 2012 Abstract 112 Green WSUS Seifedine Kadry, Chibli Joumaa American University of the Middle East Kuwait The new era of information
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Ensuring Reliability and High Availability in Cloud by Employing a Fault Tolerance Enabled Load Balancing Algorithm G.Gayathri [1], N.Prabakaran [2] Department of Computer
Efficient Analysis of Big Data Using Map Reduce Framework
Efficient Analysis of Big Data Using Map Reduce Framework Dr. Siddaraju 1, Sowmya C L 2, Rashmi K 3, Rahul M 4 1 Professor & Head of Department of Computer Science & Engineering, 2,3,4 Assistant Professor,
[Sathish Kumar, 4(3): March, 2015] ISSN: 2277-9655 Scientific Journal Impact Factor: 3.449 (ISRA), Impact Factor: 2.114
IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY HANDLING HEAVY-TAILED TRAFFIC IN QUEUEING NETWORKS USING MAX WEIGHT ALGORITHM M.Sathish Kumar *, G.Sathish Kumar * Department
Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014
White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page
Hadoop Distributed File System. Dhruba Borthakur June, 2007
Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle
Flexible Deterministic Packet Marking: An IP Traceback Scheme Against DDOS Attacks
Flexible Deterministic Packet Marking: An IP Traceback Scheme Against DDOS Attacks Prashil S. Waghmare PG student, Sinhgad College of Engineering, Vadgaon, Pune University, Maharashtra, India. [email protected]
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
Fault Tolerance Techniques in Big Data Tools: A Survey
on 21 st & 22 nd April 2014, Organized by Fault Tolerance Techniques in Big Data Tools: A Survey Manjula Dyavanur 1, Kavita Kori 2 Asst. Professor, Dept. of CSE, SKSVMACET, Laxmeshwar-582116, India 1,2
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
GEO DISTRIBUTED PARALLELIZATION PACTS IN MAP REDUCE FRAMEWORK IN MULTIPLE DATACENTER
GEO DISTRIBUTED PARALLELIZATION PACTS IN MAP REDUCE FRAMEWORK IN MULTIPLE DATACENTER C.Kirubanantham 1, C.Rajavenkateswaran 2 1 Student, computer science engineering, Nandha College of Technology, India
CDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
A Survey Paper: Cloud Computing and Virtual Machine Migration
577 A Survey Paper: Cloud Computing and Virtual Machine Migration 1 Yatendra Sahu, 2 Neha Agrawal 1 UIT, RGPV, Bhopal MP 462036, INDIA 2 MANIT, Bhopal MP 462051, INDIA Abstract - Cloud computing is one
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS
A REAL TIME MEMORY SLOT UTILIZATION DESIGN FOR MAPREDUCE MEMORY CLUSTERS Suma R 1, Vinay T R 2, Byre Gowda B K 3 1 Post graduate Student, CSE, SVCE, Bangalore 2 Assistant Professor, CSE, SVCE, Bangalore
Apache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
ENHANCED HYBRID FRAMEWORK OF RELIABILITY ANALYSIS FOR SAFETY CRITICAL NETWORK INFRASTRUCTURE
ENHANCED HYBRID FRAMEWORK OF RELIABILITY ANALYSIS FOR SAFETY CRITICAL NETWORK INFRASTRUCTURE Chandana Priyanka G. H., Aarthi R. S., Chakaravarthi S., Selvamani K. 2 and Kannan A. 3 Department of Computer
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
A Hybrid Electrical and Optical Networking Topology of Data Center for Big Data Network
ASEE 2014 Zone I Conference, April 3-5, 2014, University of Bridgeport, Bridgpeort, CT, USA A Hybrid Electrical and Optical Networking Topology of Data Center for Big Data Network Mohammad Naimur Rahman
http://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data
Client Based Power Iteration Clustering Algorithm to Reduce Dimensionalit in Big Data Jaalatchum. D 1, Thambidurai. P 1, Department of CSE, PKIET, Karaikal, India Abstract - Clustering is a group of objects
Enhancing the Scalability of Virtual Machines in Cloud
Enhancing the Scalability of Virtual Machines in Cloud Chippy.A #1, Ashok Kumar.P #2, Deepak.S #3, Ananthi.S #4 # Department of Computer Science and Engineering, SNS College of Technology Coimbatore, Tamil
