Performance Analysis of Multi-Node Hadoop Clusters using Amazon EC2 Instances
|
|
|
- Giles Anderson
- 10 years ago
- Views:
Transcription
1 Performance Analysis of Multi-Node Hadoop Clusters using Amazon EC2 Instances Ruchi Mittal 1, Ruhi Bagga 2 1, 2 Rayat Bahra Group of Institutes, Punjab Technical University, Patiala, Punjab, India Abstract: Hadoop, an open source implementation of MapReduce model, is an effective tool for handling, processing and analyzing unstructured data generated these days by different cloud applications. Hadoop considers its nodes to be homogeneous in terms of their processing capability in a cluster. But in real word applications nodes in a cluster are heterogeneous in terms of their processing capability. In such cases, Hadoop does not yields effective performance levels In this paper, we had evaluated and analyzed the performance of WordCount MapReduce application using Hadoop on Amazon EC2 using different Ubuntu instances. The performance has been evaluated both on single node and multi-node clusters. Multi-node clusters include both the homogeneous and the heterogeneous clusters. The performance is evaluated in terms of execution time of the application on different file sizes. Keywords: Cloud Computing, Hadoop, MapReduce, Homogeneous & Heterogeneous Hadoop cluster 1. Introduction Nowadays, humongous amounts of data are generated continuously by the use of various applications which range from Business Computing, Internet, Social Media (e.g. Facebook, Twitter) to Scientific Research. With the growing size of this data every day the need to handle, manage and analyze the same is also growing. To handle such large volume of unstructured data, MapReduce has proven to be an efficient technique. MapReduce, first proposed by Google in 2004, is an efficient programmable framework for handling large data in a parallel, distributed manner in a cluster of many systems. Hadoop, part of the Apache project sponsored by the Apache Software Foundation, is an open source implementation of the MapReduce programming paradigm that allows distributed processing of large datasets on programmable clusters of computers. Hadoop is built to handle terabytes and petabytes of unstructured data. It is used to handle large datasets over extensive applications. Hadoop is the answer to many questions generated by the challenges of Big Data. we have used Amazon EC2 Ubuntu instances, service provided by Amazon Web Service (AWS). We had evaluated the performance analysis of Hadoop on both homogeneous and heterogeneous multi-node clusters and performance comparison has been made of both describing the experimental results. 2. Hadoop Architecture Hadoop is an open source implementation of the MapReduce programming paradigm supported by Apache Software Foundation. It is a scalable, fault tolerant, flexible and distributed system for data storage and processing. There are two main components of Hadoop: Hadoop HDFS and Hadoop MapReduce. Hadoop HDFS is for the storage of data and Hadoop MapReduce for processing, running parallel computations on data and retrieval of data. MapReduce is considered the heart of the Hadoop system which performs the parallel processing over large datasets generally in size of terabytes and petabytes. Hadoop is based on batch processing and handles large unstructured data as compared to traditional relational database systems which works on the structured data only. Heterogeneity and Data Locality are the main factors affecting the performance of the Hadoop system because of its architecture. In the classic homogeneous Hadoop system, all the nodes have the same processing ability and hard disk capacity. However in the real world applications the nodes may be of different processing ability and hard disk capacities. With the default Hadoop strategy, the faster nodes may finish processing their local data at a greater speed. After finishing the task with the local data faster nodes can then work with non-local data which would be present on slower nodes. This requires more data movement between the nodes, thus affecting the performance of Hadoop. In this work, we have done performance analysis of Hadoop in a Single Node Cluster, that is, in a pseudo-distributed mode on different physical machines by executing a word counting application on different input data sizes. For performance evaluation of Hadoop in Multi Node Clusters In figure 1, the client machine has Hadoop installed with all the cluster settings. It is neither a NameNode nor DataNode. The role of the client machine is to load the data into the cluster, submitting the MapReduce jobs to the nodes and then monitoring the job processing and retrieving the results when job is finished. Hadoop runs best on the Linux machines, working directly with the underlying hardware. Paper ID: SUB
2 periodically and merges them into the fsimage file which contains the metadata about the file system. On the failure of NameNode, the information about the blocks can be recovered from it. 2.2 MapReduce 2.1 HDFS Figure 1: Hadoop Architecture MapReduce is the programmable framework for processing data in parallel in a cluster. The applications are written using MapReduce programming which act on the large datasets stored in the HDFS in Hadoop. Job submission, job initialization, task assignment, task execution, progress and status update and all other activities related to the job completion are handled by MapReduce. In this, all the activities are managed by the JobTracker and are executed by the TaskTracker which are the main components of the MapReduce exploiting master/slave architecture. HDFS is the Hadoop Distributed File System implemented by Yahoo based on Google File System. As its name implies, it is a distributed, reliable, fault tolerant file system. HDFS can be seen as master/slave architecture which contains NameNode, DataNode and Secondary NameNode. NameNode is the master node which controls all the DataNodes and handles all the file system operations. DataNodes are the slave nodes which perform the actual working like block operations. There is also Secondary NameNode in HDFS which acts like the housekeeping node of NameNode. The Client partitions the data into blocks which are then stored on the DataNodes in a cluster. With the replication factor of three, HDFS places the first copy of the block to the local node, second to the other DataNode of the local rack and the last copy to a different node in a different rack. This block replication is for prevention of data loss in case of DataNode failure. The default block size in HDFS is defined as 64 MB which can also be increased if required. NameNode, DataNode and Secondary NameNode are the main components of HDFS, role of each of which is discussed as below: NameNode is the master node of HDFS which communicates with the HDFS Client. It holds all the metadata of the whole file system in a file named fsimage and oversees the health of all the DataNodes and coordinates access to data in a cluster. The NameNode is the central controller of HDFS. It maps the whole data of the cluster to different DataNodes. It also keeps track of all the transactions carried out in the cluster. When the NameNode is down, the whole cluster is down. DataNodes are the slave nodes of the HDFS which performs the actual operations on the requests submitted by the client to the NameNode. DataNodes communicate with the NameNode by sending heartbeat messages every three seconds via TCP handshake. Every tenth heartbeat is a Block Report where the DataNodes tell NameNode about all the data blocks it has using which NameNode builds its metadata. Secondary NameNode acts like the housekeeping node of the NameNode. It checks the file system for changes In this, processing is carried out in two different phases namely Map Phase and Reduce Phase. In map phase, the input is partitioned into small size chunks which are processed in parallel. The output of this phase are the <key, value> pairs which are given to the reducer which shuffles, sorts and combines all the outputs to produce a single output. The role of JobTracker and TaskTracker is described as follow: JobTracker is the master to the TaskTracker. It schedules and coordinates all the jobs submitted by the client and also handles the task distribution to the TaskTracker. JobTracker and TaskTracker use heartbeat messages to communicate with each other. When a job is submitted by the client, the JobTracker communicates with the NameNode to locate the data required for processing which then submits the job to the different TaskTrackers for processing. TaskTrackers periodically send heartbeat messages to the JobTracker to ensure that they are alive and doing the task allocated. If JobTracker doesn t receive a message from a TaskTracker for a particular period of time, it then considers that node to be dead and reallocates the task allocated to that TaskTracker to some other TaskTracker which is still alive. TaskTracker receives the job from the JobTracker and breaks them into map and reduce tasks. It executes the tasks and reports the status update to the JobTracker by sending heartbeat messages and sends the output at the end. 3.3 Amazon EC2 AWS, a cloud-computing platform, is a collection of remote computing services offered by Amazon.com. Amazon Elastic Compute Cloud (EC2) is a well known web service provided by the AWS which provides resizable compute capacity in the cloud much faster and cheaper than building physical servers. With Amazon EC2, one can easily create, launch, reboot and terminate the virtual servers. It allows the user to configure these virtual servers according to their requirements. Amazon, Google, Microsoft, IBM, Apple, Citrix, VMware, etc. are some of the major cloud service providers as of today. Paper ID: SUB
3 3. Experimental Setup International Journal of Science and Research (IJSR) The experiments were setup using Amazon Elastic Cloud Computing (EC2) platform by Amazon Web Service cloud. For our series of experiments, we have used different types of EC2 instances which include different t2 and m3 instances. The hardware configuration of all the instances which we have used for our work is shown in table 1. Place table titles above the tables. Table 1: Hardware configuration of selected EC2 instances Instance Type vcpu Memory (GiB) Clock Speed (GHz) t2.micro 1 1 upto 3.3 t2.medium 2 4 upto 3.3 t2.large 2 8 upto 3.0 m3.medium m3.large Table 1 : Results of wordcount on Homogeneous Clusters Input Data Execution Time (in seconds) Size (GB) 6 Node Cluster (4 Slaves) 8 Node Cluster (6 Slaves) 10 Node Cluster (8 Slaves) The Amazon EC2 t2 instances are based on Intel(R) Xeon(R) processor and m3 instances are based on Intel(R) Xeon E v2 processors. The operating system on these instances is 64-bit Ubuntu server LTS. All instances are populated with 24 GB SSD memory. Hadoop installed version is and 64-bit Java OpenJDK 1.7.0_25 is installed on all EC2 instances. We have used EC2 t2.micro type nodes for homogeneous cluster. For heterogeneous cluster, different mentioned nodes have been selected. The MapReduce application used for performance evaluation is wordcount, which count the number of occurrences of each word in a given input file. 4. Results and Observations In this section, we have compared the results of MapReduce application wordcount executed on different file sizes ranging from 300MB to 6GB. This application is executed on different clusters configured with 6, 8 and 10 number of nodes with one node as a NameNode and one Secondary NameNode and all other as DataNodes. The clusters are taken to be homogeneous as well as heterogeneous in their configuration. 4.1 Results of Homogeneous Clusters The wordcount application is executed on three different homogeneous clusters with 6 nodes, 8 nodes and 10 nodes on different input file sizes containing 300MB to 6GB of text data to count the number of occurrences of each word in the files.. All these multi node clusters are populated with t2.micro EC2 instances. All these clusters have one NameNode, one Secondary NameNode and other nodes as DataNodes. Table 2 and Figure 2 shows the results and performance comparison of these three homogeneous clusters. Figure 1: Performance comparison of Homogeneous Clusters 4.2 Results of Heterogeneous Clusters and its comparison with Homogeneous Clusters We have configured three heterogeneous clusters having 6 number of nodes i.e. one NameNode, one Secondary NameNode and four DataNodes. These three clusters differ in terms of their configuration i.e. the type of DataNodes selected for the cluster to test different levels of heterogeneity. The cluster configuration of all these three clusters has been shown in the table 3. Table 2: Configuration of Heterogeneous Clusters NameNode & Heterogeneous Cluster Secondary NameNode DataNodes (4 Slaves) Cluster 1 t2.micro t2.micro, t2.micro, t2.micro, t2.large Cluster 2 t2.micro t2.micro, t2.medium, m3.medium, m3.large Cluster 3 t2.micro t2.micro, t2.medium, t2.large, m3.large Table 4 shows the execution time taken by different heterogeneous cluster to execute the same wordcount application on the same input text files as in the homogeneous clusters. The graph in figure 3 depicts the performance comparison of these clusters with each other and also shows the comparison with homogeneous clusters. Paper ID: SUB
4 Table 3 : Results of wordcount on Heterogeneous Clusters Input File Size Execution Time (in seconds) Heterogeneous Cluster (GB) Cluster 1 Cluster 2 Cluster Figure 3: Performance comparison of Heterogeneous Clusters with Homogeneous Clusters 5. Conclusion In this paper, we have analyzed and observed the results of MapReduce application wordcount on different homogeneous and heterogeneous cloud based Hadoop clusters on different input files sizes. In homogeneous clusters, it has been concluded that a threshold exists below which adding more nodes does not result in performance enhancement of the cluster. But after that value, with increasing number of DataNodes, the Hadoop cluster performance can be enhanced. For heterogeneous clusters, it has been concluded that with right combination of heterogeneous nodes, the Hadoop cluster performance can be made better than that of homogeneous one. But a wrong combination can result in performance degradation and additional overhead between the DataNodes. 6. Future Scope In real world applications, heterogeneous nodes are unavoidable. Our future work aims in the direction of developing some methods which can suggest some combinations for heterogeneous cluster for enhancing their performance as compared to the present scenario. References [1] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Kartz, Ion Stocia, Improving MapReduce Performance in Heterogeneous Environments, In In Proceedings of 8th USENIX Symposium on Operating Systems Design and Implementation, 29-42, ACM Press, [2] Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, Xiao Qin, Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters,In Proceedings of International Symposium on Parallel and Distributed Processing, Workshops and PhD Forum,1-9 Apr [3] B. Thirumala, N.V. Sridevi, V. Krinshna Reddy, L.S.S. Reddy, Performance Issues of Heterogeneous Hadoop Clusters in Cloud COmputing, Global Journal of Computer Science and Technology, v. 11, [4] Visalakshi P and Karthik TU, MapReduce Scheduler Using Classifiers for Heterogeneous Workloads,in IJCSNS, 2011, v. 11, [5] Yuanquan Fan, Weiguo Wu, Haijun Cao, Huo Zhu, Xu Zhao, Wei Wei, A Heterogeneity Aware Data Distribution and Rebalance Method in Hadoop Cluster, In Proceedings of 7th ChinaGrid Annual Conference, , IEEE 2012 [6] Xuelian Lin, Zide Meng, Chuan Xu, Meng Wang, A Pratical Performance Model for Hadoop MapReduce, In Proceedings of the International Conference on Cluster Computing Workshops, , IEEE [7] M. Maurya, S. Mahajan, Performance Analysis of MapReduce Programs on Hadoop Cluster, In Proceedings of World Congress on Information and Communication Technologies, , [8] M. Ishii, Jungkyu Han, H. Mankino, Design and Performance Evaluation for Hadoop Clusters on Virtualized Environment,In Proceedings of International Conference on Information Networking, , [9] Zhuoyao Zhang, Ludmila Cherksova, Boon Thau Loo, Performance Modeling od MapReduce Jobs in Heterogeneous Cloud Environments, In Proceedings of the 6th International Conference on Cloud Computing, , IEEE [10] Wei Dei, Mostafa Bassiouni, An Improved Task Assignment Scheme for Hadoop running in Clouds, Journal of Cloud Computing, v. 2:23,1-16,2013. [11] J. Nandimath, E. Banerjee, A.Patil, P. Kakade, Big Data Analysis using Apache Hadoop, In Proceedings of 14th International Conference on Information Reuse and Intergration, , IEEE [12] Parth Gohil, Dweepna Garg, Bakul Panchal, A Performance Analysis of MapReduce Applications on Big Data in Cloud based Hadoop, International Conference on Information Communication and Embedded Systems (ICICES), 1-6, 2014 [13] M.F. Hyder, M.A. Ismail, H. Ahmed, Performance Comparison of Hadoop Clusters Configured on Virtual Machines and as a Cloud Service, In Proceedings of Paper ID: SUB
5 International Conference on International Technologies, 60-64, [14] Ivanilton Polato, Reginaldo Re, Alfredo Goldman, Fabio Kon, A Comprehensive view of Hadoop Research- A Systematic Literature Review, Elsevier- Journal of Network and Computer Applications,v. 46, pp. 1-25, [15] Javier Conejero, Blanca Caminero, Carmen Carrion, Analysing Hadoop Performancein a Multi-user IaaS Cloud, In Proceedings of International Conference on High Performance Computing & Simulation (HPCS), , [16] Jessica Hartog, Renan DelValle, Madhusudhan Govindaraju, Maichael J. Lewis, Configuring A MapReduce Framework For Performance Heterogeneous Clusters, In Proceedings of International Congress on Big Data, , IEEE [17] Amazon EC2. [18] Hadoop Tutorial [online]. Available: hadoop.apache.org [19] T. White, Hadoop: The Definitive Guide, O Reilly Media, Yahoo! Press, Paper ID: SUB
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),
Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India
Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services
RESEARCH ARTICLE Adv. Sci. Lett. 4, 400 407, 2011 Copyright 2011 American Scientific Publishers Advanced Science Letters All rights reserved Vol. 4, 400 407, 2011 Printed in the United States of America
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Hadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Efficient Support of Big Data Storage Systems on the Cloud
Efficient Support of Big Data Storage Systems on the Cloud Akshay MS, Suhas Mohan, Vincent Kuri, Dinkar Sitaram, H. L. Phalachandra PES Institute of Technology, CSE Dept., Center for Cloud Computing and
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Research Article Hadoop-Based Distributed Sensor Node Management System
Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung
An improved task assignment scheme for Hadoop running in the clouds
Dai and Bassiouni Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:23 RESEARCH An improved task assignment scheme for Hadoop running in the clouds Wei Dai * and Mostafa Bassiouni
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Performance Analysis of Book Recommendation System on Hadoop Platform
Performance Analysis of Book Recommendation System on Hadoop Platform Sugandha Bhatia #1, Surbhi Sehgal #2, Seema Sharma #3 Department of Computer Science & Engineering, Amity School of Engineering & Technology,
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Improving MapReduce Performance in Heterogeneous Environments
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
BBM467 Data Intensive ApplicaAons
Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal [email protected] Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes
Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani
Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured
A Cost-Evaluation of MapReduce Applications in the Cloud
1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team 2/23 1 MapReduce applications - case study 2 3 4 5 3/23 MapReduce
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Advances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Adaptive Task Scheduling for Multi Job MapReduce
Adaptive Task Scheduling for MultiJob MapReduce Environments Jordà Polo, David de Nadal, David Carrera, Yolanda Becerra, Vicenç Beltran, Jordi Torres and Eduard Ayguadé Barcelona Supercomputing Center
An efficient Mapreduce scheduling algorithm in hadoop R.Thangaselvi 1, S.Ananthbabu 2, R.Aruna 3
An efficient Mapreduce scheduling algorithm in hadoop R.Thangaselvi 1, S.Ananthbabu 2, R.Aruna 3 1 M.E: Department of Computer Science, VV College of Engineering, Tirunelveli, India 2 Assistant Professor,
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
HadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
Introduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Improving Performance of Hadoop Clusters. Jiong Xie
Improving Performance of Hadoop Clusters by Jiong Xie A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy
Telecom Data processing and analysis based on Hadoop
COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Introduction. Various user groups requiring Hadoop, each with its own diverse needs, include:
Introduction BIG DATA is a term that s been buzzing around a lot lately, and its use is a trend that s been increasing at a steady pace over the past few years. It s quite likely you ve also encountered
MapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
Big Data Analysis and Its Scheduling Policy Hadoop
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
Apache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
Apache Hadoop FileSystem Internals
Apache Hadoop FileSystem Internals Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Storage Developer Conference, San Jose September 22, 2010 http://www.facebook.com/hadoopfs
Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
Design of Electric Energy Acquisition System on Hadoop
, pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University
Performance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
Hadoop Scalability and Performance Testing in Heterogeneous Clusters
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 441 Hadoop Scalability and Performance Testing in Heterogeneous Clusters Fernando G. Tinetti 1, Ignacio Real 2, Rodrigo Jaramillo 2, and Damián
How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13
How to properly misuse Hadoop Marcel Huntemann NERSC tutorial session 2/12/13 History Created by Doug Cutting (also creator of Apache Lucene). 2002 Origin in Apache Nutch (open source web search engine).
Storage and Retrieval of Data for Smart City using Hadoop
Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks
Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,
A SURVEY ON MAPREDUCE IN CLOUD COMPUTING
A SURVEY ON MAPREDUCE IN CLOUD COMPUTING Dr.M.Newlin Rajkumar 1, S.Balachandar 2, Dr.V.Venkatesakumar 3, T.Mahadevan 4 1 Asst. Prof, Dept. of CSE,Anna University Regional Centre, Coimbatore, [email protected]
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
Analysis of Information Management and Scheduling Technology in Hadoop
Analysis of Information Management and Scheduling Technology in Hadoop Ma Weihua, Zhang Hong, Li Qianmu, Xia Bin School of Computer Science and Technology Nanjing University of Science and Engineering
Resource Scalability for Efficient Parallel Processing in Cloud
Resource Scalability for Efficient Parallel Processing in Cloud ABSTRACT Govinda.K #1, Abirami.M #2, Divya Mercy Silva.J #3 #1 SCSE, VIT University #2 SITE, VIT University #3 SITE, VIT University In the
A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM
A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM Ramesh Maharjan and Manoj Shakya Department of Computer Science and Engineering Dhulikhel, Kavre, Nepal [email protected],
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from
Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')
6. How MapReduce Works. Jari-Pekka Voutilainen
6. How MapReduce Works Jari-Pekka Voutilainen MapReduce Implementations Apache Hadoop has 2 implementations of MapReduce: Classic MapReduce (MapReduce 1) YARN (MapReduce 2) Classic MapReduce The Client
Processing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)
Open Cloud System (Integration of Eucalyptus, Hadoop and into deployment of University Private Cloud) Thinn Thu Naing University of Computer Studies, Yangon 25 th October 2011 Open Cloud System University
Sector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
