A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

Size: px
Start display at page:

Download "A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM"

Transcription

1 A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM Ramesh Maharjan and Manoj Shakya Department of Computer Science and Engineering Dhulikhel, Kavre, Nepal Abstract Cloud computing, data and distributed systems are three important aspects of this paper. Cloud computing is being embraced by every organization and is being implemented in every field of work, be it in business or in education. Data storage and processing is fundamental task of any organization. Hadoop is a distributed framework created to handle the big data processing task. The aim of this paper is to study and analyze different aspects such as performance, flexibility, scalability and more on Hadoop clusters in the cloud and in commodity. Introduction Cloud computing is an abstract term describing the use of resources, which don t belong to the user to perform required task and then disconnect from the resources not in use. Buyya et al. [1] have defined it as follows: Cloud is a parallel and distributed computing system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources based on service-level agreements (SLA) established through negotiation between the service provider and consumers. Vaquero et al. [2] have stated clouds are a large pool of easily usable and accessible virtualized resources (such as hardware, development platforms and/or services). These resources can be dynamically reconfigured to adjust to a variable load (scale), allowing also for an optimum resource utilization. This pool of resources is typically exploited by a pay-per-use model in which guarantees are offered by the Infrastructure Provider by means of customized Service Level Agreements. Cloud computing [3] technology started very early when mainframe computers were available in academia and corporations, accessible through clients computers to be shared. Cloud computing [5] is based on different computing research areas such as HPC, virtualization, utility computing, grid computing, networking, security and many others. Depending on service providers cloud computing can be broadly divided into: [4] Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service(IaaS). SaaS provides both resources and applications as a Service to the clients. PaaS provides a level above SaaS by enabling clients to access to platforms that they need to develop. And IaaS provides the storage and computing infrastructure to the clients over the internet. Based on technology used cloud computing can be subcategorized into: [4] Public, Private, Community and Hybrid clouds. A public cloud is accessible with an internet connection. A private cloud is limited to an organization or group. A community cloud is a private cloud shared between groups or organizations. And hybrid cloud is a mixture of at least two clouds explained above. There are many platforms available to set up a cloud. Openstack was chosen due to following reasons: First of all, it is open-source, meaning it is open to pick and mix any hardware needs, open to design own networks, open to use any virtualization technology, open to other needed features and so on. Secondly, it has the largest group of developers and contributors. Thirdly it is simple to configure and use. Data is an inseparable entity of any organization. According to [6] Every day, we create 2.5 quintillion bytes of data. The obvious cause is the advancement of technology and its use. The piled up data needs to be stored,

2 processed and analyzed to get useful information. There are many technical solutions to it but Hadoop is chosen due to following reasons: Hadoop makes data mining, analytics, and processing of big data cheap and fast. Hadoop is an open source project and is made to deal with terabytes of data in minutes. Hadoop is the only way that companies with gigantic amounts of data like Facebook, Twitter, Yahoo, ebay and Amazon can cost-effectively and quickly make decisions. Hadoop is easily scalable as hard drives and nodes can be added without need of shutting down Hadoop. Hadoop stores and processes any kind of data. Hadoop is natively written in Java but can be accessed using other languages such as SQL-inspired language (Hive), c/c++, python and many more. Knowledge on Hadoop is a must to understand the paper. The Apache Hadoop software library is a framework that allows the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. It provides for building distributed system for data storage, data analysis, and coordination. A framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System (GFS) and of Map-Reduce. Hadoop employs a master/slave architecture for both distributed storage and distributed computation. This paper is divided into five parts, Introduction, cloud, cluster, analysis and finally conclusion. The cloud section explains in brief about Openstack cloud. The cluster section explains the clustering of Hadoop system in real computers and in the cloud and also explains running of applications on the distributed systems. The analysis section describes the comparison of two system with appropriate graphs. The cloud According to the official Openstack starter guide [7] Cloud computing is a computing model, where resources such as computing power, storage, network and software are abstracted and provided as services on the Internet in a remotely accessible fashion.. An infrastructure setup using the cloud computing model is generally referred to as the "cloud". The guide further explains that Openstack is a collection of open source software projects that enterprises/service providers can use to setup and run their cloud compute and storage infrastructure. The installation of cloud (Openstack) consists of five projects: nova-compute, swift, glance, keystone and Horizon which is more clear with the following figures. Fig 1 simple Openstack architecture Fig 2 simple Openstack workflow

3 There are many reasons behind choosing Openstack. The first reason is the flexibility in Openstack regarding networking and hardware. Open-source is the second reason with millions of developers and contributors. Thirdly Openstack has become phenomenon meaning that big companies like Dell, AMD, Cisco, HP and others alongside Rackspace are using it. Furthermore, linux heavyweights like Red Hat and Ubuntu are implementing Openstack. Lastly but not the least, it is simple and flexible to configure and use. The clusters After the successful installation and configuration of Openstack cloud, virtual servers were created in the cloud. For this experiment, under Hadoop project, four instances were created, one acting as master/slave and the rest three as slaves. Four-node cluster was created for the Hadoop jobs to be run and analysed. The same configuration was applied to personal computers as well. The cluster in the cloud was made identical to the cluster in the real systems. The details of the clusters are as given in the table below. Table 1: details of the clusters Servers cloud vs real Details (personal computers) Details (virtual servers in cloud) Master vs kucse-dcg 2 GB Ram 2 VCPU 160 GB storage 2 GB Ram 2 VCPU 80 GB storage Slave1 vs user 2 GB Ram 2 VCPU 80 GB storage 2 GB Ram 2 VCPU 80 GB storage Slave2 vs user1 2 GB Ram 2 VCPU 80 GB storage 2 GB Ram 2 VCPU 80 GB storage Slave3 vs user3 2 GB Ram 2 VCPU 80 GB storage 2 GB Ram 2 VCPU 80 GB storage Screenshot of the dashboard of the cloud is given below. Fig 3: dashboard of Kathmandu University cloud with Hadoop project

4 After the successful configuration of the clusters, three jobs: two jobs to convert image files to pdf files and one job of word count were run on both systems. The first two jobs were based on image and pdf files being serialized in map reduce framework [9][10][11], and the last job was available in the Hadoop package itself. The results were not so much contradictory which are summarized in the tables given below. Table 2: summary of first job Cluster in personal computers Cluster in virtual servers in cloud inputs 23 folder 94 image files 169 MB 23 folder 94 image files 169 MB outputs 23 folder 94 pdf files 90.1 MB 1 folder 94 pdf files 90.1 MB Time taken 3 minutes and 8 seconds 1 minute and 31 seconds Table 3: summary of second job Cluster in personal computers Cluster in virtual servers in cloud inputs 1 folder 476 image files 926 MB 1 folder 476 image files 926 MB outputs 1 pdf files MB 1 pdf files MB Time taken 7 minutes and 51 seconds 9 minutes and 22 seconds Table 4: summary of third job Cluster in personal computers Cluster in virtual servers in cloud inputs 1 text file 1.1 GB 1 text file 1.1 GB outputs 1 text file with counts KB 1 text file with counts KB Time taken 4 minutes and 0 second 5 minutes and 1 second Three graphs generated from the above tables are given below. Graph 1 time taken for the first job Graph 2 time taken for the second job Graph 3 time taken for the third job

5 Analysis The Hadoop distributed system set up on personal computers was certain to be more efficient and faster than the cloud system. The obvious reasons were that Hadoop framework is developed with commodity machines in mind and that the processing is done in real hardware without any resources sharing as compared to cloud systems. The first job was contradictory with the points discussed above and with other two jobs. The reason is that the job has to recursively read and write files, thus has to cache all the bytes read and due to hardware inefficiency in the personal computers used (especially memory), the job has to be canceled from the inefficient machines (task trackers). Thus the job was a bit slow in real computers due to cancellation of jobs and transferring the job to other tasktrackers (nodes/machines). To sum up, the personal computers Hadoop cluster was as powerful and fault tolerant as the cloud Hadoop cluster but not as scalable and flexible as the cloud cluster. The reasons were that the real systems are of real hardware and can not be made available easily but the cloud cluster was of virtualized system so could be varied according to needs and availability. Creating a new virtual server (node) in the cloud could be done in seconds but adding a new node in the real system had to account the availability of computers. A problem in virtual server can be solved by terminating and creating a new one but the same could not be applied in real system. The real system were connected with network cables so had a little but insignificant delay in communication between nodes in real system while the nodes were inside the same project which made the communication between nodes much faster. Most important aspect of the cloud is that server creation and termination is extraordinarily easy i.e. the cluster can be scaled up and down with great ease. Another aspect of the cloud is that the cluster can be made public without any difficulty as we can see in the above screenshot that the master server s web interface is accessed through browser in the client. Conclusion This paper is an analysis of running a Hadoop cluster in cloud and in real system and identifying the best solution by running simple Hadoop jobs in the configured clusters. This paper concludes that running a Hadoop cluster in cloud for data storage and analysis is more flexible and easily-scalable than the real system cluster. This paper also concludes the cluster in real system computers are faster than the cloud clusters. But due to different advantageous features of the cloud computing system such as quick termination of servers (nodes) if problems arise and creation of the node from the same state the machine was terminated, automatic networking, instant creation of nodes and cluster and many such features cloud Hadoop cluster would be more favorable. References [1] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility, Future Generation Computer Systems, 25:599_616, [2] L. M. Vaquero, L. Rodero-Merino, J. Caceres, and M. Lindner, A break in the clouds: Towards a cloud definition, SIGCOMM Computer Communications Review, 39:50_55, [3] Cloud Computing on Wikipedia, en.wikipedia.org/wiki/cloudcomputing [4] [5] Cloud Computing: Principles and Paradigms, Edited by Rajkumar Buyya, James Broberg and Andrzej Goscinski

6 [6] [7] Openstack compute starter guide, May (essex) [8] [9] Tom White, Hadoop: The Definitive Guide, O Reilly Media, 2009 [10] Chuck Lam, Hadoop in Action, MEAP Unedited Draft, 2010 [11] Bruno Lowagie, itext in Action, 2010

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud) Open Cloud System (Integration of Eucalyptus, Hadoop and into deployment of University Private Cloud) Thinn Thu Naing University of Computer Studies, Yangon 25 th October 2011 Open Cloud System University

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents

More information

Sistemi Operativi e Reti. Cloud Computing

Sistemi Operativi e Reti. Cloud Computing 1 Sistemi Operativi e Reti Cloud Computing Facoltà di Scienze Matematiche Fisiche e Naturali Corso di Laurea Magistrale in Informatica Osvaldo Gervasi ogervasi@computer.org 2 Introduction Technologies

More information

Grid Computing Vs. Cloud Computing

Grid Computing Vs. Cloud Computing International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 6 (2013), pp. 577-582 International Research Publications House http://www. irphouse.com /ijict.htm Grid

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Cisco Integration Platform

Cisco Integration Platform Data Sheet Cisco Integration Platform The Cisco Integration Platform fuels new business agility and innovation by linking data and services from any application - inside the enterprise and out. Product

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Mobile Cloud Computing T-110.5121 Open Source IaaS

Mobile Cloud Computing T-110.5121 Open Source IaaS Mobile Cloud Computing T-110.5121 Open Source IaaS Tommi Mäkelä, Otaniemi Evolution Mainframe Centralized computation and storage, thin clients Dedicated hardware, software, experienced staff High capital

More information

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software Engineer, @MirantisIT

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software Engineer, @MirantisIT Hadoop on OpenStack Cloud Dmitry Mescheryakov Software Engineer, @MirantisIT Agenda OpenStack Sahara Demo Hadoop Performance on Cloud Conclusion OpenStack Open source cloud computing platform 17,209 commits

More information

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms Volume 1, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Analysis and Research of Cloud Computing System to Comparison of

More information

MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

More information

Big Data and Cloud Computing for GHRSST

Big Data and Cloud Computing for GHRSST Big Data and Cloud Computing for GHRSST Jean-Francois Piollé (jfpiolle@ifremer.fr) Frédéric Paul, Olivier Archer CERSAT / Institut Français de Recherche pour l Exploitation de la Mer Facing data deluge

More information

Cloud Computing Backgrounder

Cloud Computing Backgrounder Cloud Computing Backgrounder No surprise: information technology (IT) is huge. Huge costs, huge number of buzz words, huge amount of jargon, and a huge competitive advantage for those who can effectively

More information

CLOUD COMPUTING. When It's smarter to rent than to buy

CLOUD COMPUTING. When It's smarter to rent than to buy CLOUD COMPUTING When It's smarter to rent than to buy Is it new concept? Nothing new In 1990 s, WWW itself Grid Technologies- Scientific applications Online banking websites More convenience Not to visit

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

BIG DATA SOLUTION DATA SHEET

BIG DATA SOLUTION DATA SHEET BIG DATA SOLUTION DATA SHEET Highlight. DATA SHEET HGrid247 BIG DATA SOLUTION Exploring your BIG DATA, get some deeper insight. It is possible! Another approach to access your BIG DATA with the latest

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Cloud Computing : Concepts, Types and Research Methodology

Cloud Computing : Concepts, Types and Research Methodology Cloud Computing : Concepts, Types and Research Methodology S. Muthulakshmi Bangalore,Karnataka India- 560068 Abstract: Cloud -computing is a very popular term in this modern and computer world in IT solution

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Comparing Ganeti to other Private Cloud Platforms. Lance Albertson Director lance@osuosl.org @ramereth

Comparing Ganeti to other Private Cloud Platforms. Lance Albertson Director lance@osuosl.org @ramereth Comparing Ganeti to other Private Cloud Platforms Lance Albertson Director lance@osuosl.org @ramereth About me OSU Open Source Lab Server hosting for Open Source Projects Open Source development projects

More information

Keywords: Cloudsim, MIPS, Gridlet, Virtual machine, Data center, Simulation, SaaS, PaaS, IaaS, VM. Introduction

Keywords: Cloudsim, MIPS, Gridlet, Virtual machine, Data center, Simulation, SaaS, PaaS, IaaS, VM. Introduction Vol. 3 Issue 1, January-2014, pp: (1-5), Impact Factor: 1.252, Available online at: www.erpublications.com Performance evaluation of cloud application with constant data center configuration and variable

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

From Wikipedia, the free encyclopedia

From Wikipedia, the free encyclopedia Page 1 sur 5 Hadoop From Wikipedia, the free encyclopedia Apache Hadoop is a free Java software framework that supports data intensive distributed applications. [1] It enables applications to work with

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Certified Cloud Computing Professional VS-1067

Certified Cloud Computing Professional VS-1067 Certified Cloud Computing Professional VS-1067 Certified Cloud Computing Professional Certification Code VS-1067 Vskills Cloud Computing Professional assesses the candidate for a company s cloud computing

More information

Comparing Open Source Private Cloud (IaaS) Platforms

Comparing Open Source Private Cloud (IaaS) Platforms Comparing Open Source Private Cloud (IaaS) Platforms Lance Albertson OSU Open Source Lab Associate Director of Operations lance@osuosl.org / @ramereth About me OSU Open Source Lab Server hosting for Open

More information

Sriram Krishnan, Ph.D. sriram@sdsc.edu

Sriram Krishnan, Ph.D. sriram@sdsc.edu Sriram Krishnan, Ph.D. sriram@sdsc.edu (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce

More information

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013 Big Data Use Case How Rackspace is using Private Cloud for Big Data Bryan Thompson May 8th, 2013 Our Big Data Problem Consolidate all monitoring data for reporting and analytical purposes. Every device

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

BIG DATA USING HADOOP

BIG DATA USING HADOOP + Breakaway Session By Johnson Iyilade, Ph.D. University of Saskatchewan, Canada 23-July, 2015 BIG DATA USING HADOOP + Outline n Framing the Problem Hadoop Solves n Meet Hadoop n Storage with HDFS n Data

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Viswanath Nandigam Sriram Krishnan Chaitan Baru Viswanath Nandigam Sriram Krishnan Chaitan Baru Traditional Database Implementations for large-scale spatial data Data Partitioning Spatial Extensions Pros and Cons Cloud Computing Introduction Relevance

More information

The Client Side of Cloud Computing

The Client Side of Cloud Computing Cloud Clients Service Look-Up Resumé Literature SE aus Informatik, SS 2009 26. Mai 2009 Cloud Clients Service Look-Up Resumé Literature 1 Cloud Clients Definition Hardware Clients Software Clients Software

More information

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

A Study on Analysis and Implementation of a Cloud Computing Framework for Multimedia Convergence Services

A Study on Analysis and Implementation of a Cloud Computing Framework for Multimedia Convergence Services A Study on Analysis and Implementation of a Cloud Computing Framework for Multimedia Convergence Services Ronnie D. Caytiles and Byungjoo Park * Department of Multimedia Engineering, Hannam University

More information

Cloud computing: utility computing over the Internet

Cloud computing: utility computing over the Internet Cloud computing: utility computing over the Internet Taneli Korri Helsinki University of Technology tkorri@hut.fi Abstract Cloud computing has become a hot topic in the IT industry, as it allows people

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

How To Understand Cloud Computing

How To Understand Cloud Computing Overview of Cloud Computing (ENCS 691K Chapter 1) Roch Glitho, PhD Associate Professor and Canada Research Chair My URL - http://users.encs.concordia.ca/~glitho/ Overview of Cloud Computing Towards a definition

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Cloud Computing: Computing as a Service. Prof. Daivashala Deshmukh Maharashtra Institute of Technology, Aurangabad

Cloud Computing: Computing as a Service. Prof. Daivashala Deshmukh Maharashtra Institute of Technology, Aurangabad Cloud Computing: Computing as a Service Prof. Daivashala Deshmukh Maharashtra Institute of Technology, Aurangabad Abstract: Computing as a utility. is a dream that dates from the beginning from the computer

More information

A Survey on Cloud Computing

A Survey on Cloud Computing A Survey on Cloud Computing Poulami dalapati* Department of Computer Science Birla Institute of Technology, Mesra Ranchi, India dalapati89@gmail.com G. Sahoo Department of Information Technology Birla

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Cloud computing: the state of the art and challenges. Jānis Kampars Riga Technical University

Cloud computing: the state of the art and challenges. Jānis Kampars Riga Technical University Cloud computing: the state of the art and challenges Jānis Kampars Riga Technical University Presentation structure Enabling technologies Cloud computing defined Dealing with load in cloud computing Service

More information

How To Understand Cloud Computing

How To Understand Cloud Computing Dr Markus Hagenbuchner markus@uow.edu.au CSCI319 Introduction to Cloud Computing CSCI319 Chapter 1 Page: 1 of 10 Content and Objectives 1. Introduce to cloud computing 2. Develop and understanding to how

More information

Towards Comparative Evaluation of Cloud Services

Towards Comparative Evaluation of Cloud Services Towards Comparative Evaluation of Cloud Services Farrukh Nadeem Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia. Abstract:

More information

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases Abinav Pothuganti Computer Science and Engineering, CBIT,Hyderabad, Telangana, India Abstract Today, we are surrounded by data like oxygen. The exponential

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Cloud Computing using

Cloud Computing using Cloud Computing using Summary of Content Introduction of Cloud Computing Cloud Computing vs. Server Virtualization Cloud Computing Components Stack Public vs. Private Clouds Open Source Software for Private

More information

Cloud computing - Architecting in the cloud

Cloud computing - Architecting in the cloud Cloud computing - Architecting in the cloud anna.ruokonen@tut.fi 1 Outline Cloud computing What is? Levels of cloud computing: IaaS, PaaS, SaaS Moving to the cloud? Architecting in the cloud Best practices

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

How To Compare Cloud Computing To Cloud Platforms And Cloud Computing

How To Compare Cloud Computing To Cloud Platforms And Cloud Computing Volume 3, Issue 11, November 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Cloud Platforms

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Deep Mann ME (Software Engineering) Computer Science and Engineering Department Thapar University Patiala-147004

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Distributed File System. Jordan Prosch, Matt Kipps Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?

More information

Introduction to Big Data Training

Introduction to Big Data Training Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB

More information

A Cost-Evaluation of MapReduce Applications in the Cloud

A Cost-Evaluation of MapReduce Applications in the Cloud 1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team 2/23 1 MapReduce applications - case study 2 3 4 5 3/23 MapReduce

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

A Middleware Strategy to Survive Compute Peak Loads in Cloud

A Middleware Strategy to Survive Compute Peak Loads in Cloud A Middleware Strategy to Survive Compute Peak Loads in Cloud Sasko Ristov Ss. Cyril and Methodius University Faculty of Information Sciences and Computer Engineering Skopje, Macedonia Email: sashko.ristov@finki.ukim.mk

More information

A Gentle Introduction to Cloud Computing

A Gentle Introduction to Cloud Computing A Gentle Introduction to Cloud Computing Source: Wikipedia Platform Computing, Inc. Platform Clusters, Grids, Clouds, Whatever Computing The leader in managing large scale shared environments o 18 years

More information

Cloud on TEIN Part I: OpenStack Cloud Deployment. Vasinee Siripoonya Electronic Government Agency of Thailand Kasidit Chanchio Thammasat University

Cloud on TEIN Part I: OpenStack Cloud Deployment. Vasinee Siripoonya Electronic Government Agency of Thailand Kasidit Chanchio Thammasat University Cloud on TEIN Part I: OpenStack Cloud Deployment Vasinee Siripoonya Electronic Government Agency of Thailand Kasidit Chanchio Thammasat University Outline Objectives Part I: OpenStack Overview How OpenStack

More information

HPC ABDS: The Case for an Integrating Apache Big Data Stack

HPC ABDS: The Case for an Integrating Apache Big Data Stack HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org

More information

Virtual Machine Instance Scheduling in IaaS Clouds

Virtual Machine Instance Scheduling in IaaS Clouds Virtual Machine Instance Scheduling in IaaS Clouds Naylor G. Bachiega, Henrique P. Martins, Roberta Spolon, Marcos A. Cavenaghi Departamento de Ciência da Computação UNESP - Univ Estadual Paulista Bauru,

More information

SEAIP 2009 Presentation

SEAIP 2009 Presentation SEAIP 2009 Presentation By David Tan Chair of Yahoo! Hadoop SIG, 2008-2009,Singapore EXCO Member of SGF SIG Imperial College (UK), Institute of Fluid Science (Japan) & Chicago BOOTH GSB (USA) Alumni Email:

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

OpenStack IaaS. Rhys Oxenham OSEC.pl BarCamp, Warsaw, Poland November 2013

OpenStack IaaS. Rhys Oxenham OSEC.pl BarCamp, Warsaw, Poland November 2013 OpenStack IaaS 1 Rhys Oxenham OSEC.pl BarCamp, Warsaw, Poland November 2013 Disclaimer The information provided within this presentation is for educational purposes only and was prepared for a community

More information