A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION

Size: px
Start display at page:

Download "A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION"

Transcription

1 Speed- Up Extension To Hadoop System- A Survey Of HDFS Data Placement Sayali Ashok Shivarkar, Prof.Deepali Gatade Computer Network, Sinhgad College of Engineering, Pune, India 1sayalishivarkar20@gmail.com A B S T R A C T Apache s Hadoop is an open source implementation of Google Map/Reduce which is used for large data analysis and storage. Hadoop decompose a massive job into number of smaller tasks. Hadoop uses Hadoop Distributed File System to store data. HDFS stores files as number of blocks and replicated for fault tolerance. The block placement strategy does not consider data placement characteristics. It stores file as block randomly. The block size and replication factor are configurable parameters. An application can specify number of replica of file and it can be changed later. HDFS cluster has master/slave architecture with a single Name Node as master server which manages the file system namespace and regulates access to file by clients. The slaves are called to the number of Data Nodes. File is divided into number of one or more blocks and stores as set of blocks in Data Nodes. Opening, renaming and closing file and directory all operations are done by Name Node and Data Node are responsible for read and write request from Name Node. Hadoop uses Hadoop distributed File System (HDFS) which is an open source implementation Google File System for storing data. HDFS is used in Hadoop for storing data. Strategic data partitioning, processing, replication, layouts and placement of data blocks will increase the performance of Hadoop and a lot of research is going on in this area. This paper reviews and survey some of the major enhancements suggested to Hadoop especially in data storage, processing and placement. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION Today all people live in the data age. It is not easy to measure the total volume of Data stored electronically, but an IDC projects that the digital universe will reach 40 Zettabytes (zb) by Data is everywhere that data is very large in size. Large scale Distributed system works on this big-data. Apache's hadoop is open source implementations of Google Map/Reduce for storing and analysis such large data. It enables Distributed, data intensive and parallel applications by decomposing a massive job into smaller tasks and a massive data set into smaller partitions such that each task processes a different partition in parallel. By this task are finished early in manner of time and Work is also divided into different nodes. The main abstractions are: 1. Map tasks that process the partitions of a data set using key/value pairs to generate a set of intermediate results and 2. Reduce tasks that merge all intermediate values associated with the same inter- mediate key. File systems that manage the storage across a network of machines are called distributed File systems. Apache's Hadoop comes with a Distributed file system called HDFS, which stands for Hadoop Distributed File System. HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this , IJAFRSE All Rights Reserved

2 information. HDFS has a high degree of fault-tolerance and is developed to be deployed on low-cost hardware. HDFS provides efficient access to application data and is suitable for applications having big data sets. HDFS stores files as a series of blocks and are replicated for fault tolerance. HDFS cluster has master/slave architecture, master node called as Name Node and slave node are called as Data Nodes. Name Node is which manages the file system namespace and regulates access to files. The slaves are a number of Data Nodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS provides a file system namespace and allows user data to be stored in files. A file is partitioned into one or more blocks and these blocks are stored in a set of Data nodes. The file system namespace operations like opening, closing, and renaming of files and directories are done by the Name Node. It also decides the mapping of blocks to Data Nodes. The default block placement strategy of HDFS doesn't consider the data characteristics and places the data blocks randomly. Data needed to execute map task are placed on various nodes because of random placement, which lead to network traffic. An observation regarding map task is that they generate a large amount of intermediate data, as result of map task and this abundant information is thrown away after the processing finish. It is also observed that some map tasks are executed again and again. Apache's Hadoop system is not able to find such duplicate task, which lead to time wastage. If the similar items are clustered, the performance in Hadoop can be improved. Most of the past work provides no methods for clustering in Hadoop. Some of the techniques suggested [3, 4] provides some degree of co-location but required immense changes in the default framework and the physical data layout. II. EXISITING SYSTEM HADOOPDB HadoopDB[1] is a hybrid system that has the best feature of parallel database and map/reduce. HadoopDB takes performance and efficiency from parallel database. At same time provide faulttolerance, scalability, flexibility of Map/Reduce. In HadoopDB, multiple nodes database systems are connected using Apache's Hadoop as the network communication layer and task coordinator. It achieves fault tolerance and the ability to operate in heterogeneous environments by taking over the scheduling and job tracking features of Apache's Hadoop and achieves the parallel database performance by executing most of the query processing inside the database engine. 1. HadoopDB can perform like parallel database along with high fault tolerance, ability to run in heterogeneous environment 2. It can cut down the data processing time especially on tasks that requires complex query processing Limitation 1. It forces users to use DBMS. Installing and configuring parallel DBMS is quite difficult. 2. HadoopDB changes the interface to SQL. So the programming model is not as simple as Map/Reduce , IJAFRSE All Rights Reserved

3 3. HadoopDB locally uses ACID conforming DBMS engines which will affect the dynamic scheduling and fault tolerance of Hadoop. Hadoop++ To overcome limitation of HadoopDB is to create a system that keeps the interface ofapache's Hadoop Map/Reduce, approaches parallel databases in performance and does not change the underlying Apache's Hadoop framework. Hadoop++[2] is Apache's Hadoop with Trojan index and Trojan Join. Hadoop++ improves query runtime of HadoopDB. Trojan index is used to improve indexing capabilities of Apache's Hadoop. The basic notion behind using Trojan index is that schema and query workload is known already. The main features of Trojan index are non- invasiveness, no need to create a distributed SQL engine on top of Apache's Hadoop, optional index access path, possibility to create multiple index in the same split and seamless splitting. Trojan indexes are created during the data load time. Trojan join also created during data load time. Trojan join creation is similar to Trojan index creation. The goal behind Trojan join is to avoid reduce phase since the data were already pre-partitioned. The algorithm access each input split and collects records of the same co-group. 1. Trojan index increases performance and parallelism since the map tasks processes the index independently. 2. It also keeps the outside view of the block intact. 3. Trojan join gathers the ability to process joins at the map side and thereby allowing Map/Reduce jobs to run faster even for large volumes of data. 4. For selection queries and join operations, Hadoop++ runs faster than HadoopDB and Apache's Hadoop. No modification of Map/Reduce interface is required. Limitation 1. Arrival of new data or the modification of existing data necessitates the reorganization of Trojan index as well as Trojan join. 2. Trojan index is heavily dependent on the block size. 3. Increased block size leads to augmented index coverage. Similar to the DBMS, it is assumed that schema and query work load is known in advance static solution. 4. Required that users should reorganize the input data. Co-Hadoop Co-Hadoop[3] provides a solution for co-locating related files in Hadoop Distributed File System (HDFS). HDFS is extended to provide co-location at the system level. A new file property is used to identify related files and modify the placement policy of HDFS. This modification retains the benefits of Apache's Hadoop , IJAFRSE All Rights Reserved

4 including load balancing and fault tolerance. The default block placement policy provides load balancing through even data distribution across the Data Nodes. This policy works well for applications access and use single file improvements in performance can be obtained for applications using data from multiple files using custom-made approaches. Co-Hadoop is one such approach which helps the applications to control data placement at the file-system level. In Co-Hadoop, HDFS is extended with new file level property called locator, which gives information where file to be stored. Files with same locator value are placed in same data node but a file which does not have locator value are placed using default placement strategies. To manage locator information with files locator table is added in Name Node. 1. Flexible compare to HadoopDB and Hadoop++. Improves performance without affecting framework of Hadoop Best for application which continually consumes data. Limitation: 1. Slightly slow due to high network utilization. 2. Indexing aspect are not improved and Detail knowledge of input data are required HAIL HAIL (Hadoop Aggression Indexing Library) is an enhancement of HDFS and Apache's Hadoop Map/Reduce which improves the performance of Map/Reduce tasks. The basic idea behind HAIL is to create indexes on attributes of interest at load time with minimal changes to Apache's Hadoop. HAIL achieves this by modifying the data loading pipeline so that the index is created on each replica as it is loaded into HDFS. While Apache's Hadoop uses a pipeline that takes a block of data from the client and passes it without any change to the three replicas (default replication factor is 3) sequentially, HAIL buffers the block at each replica, sorts it based on the attribute which is indexed, create index for that block and flushes it to disk. 1. HAIL improves both upload and query times. 2. Fail over properties of Apache's Hadoop are not changed. HAIL works with existing Map/Reduce jobs incurring only minimal changes to these jobs. Limitation 1. Jobs can use only one index at a time and, memory requirements are more compared to standard Apache's Hadoop. ERMS (Elastic Replica management system) Based on data access patterns, data can be classified into three different types: hot data data having a large number of concurrent accesses and high intensity of access, cold data unpopular and rarely accessed data, normal data rest of the data other than hot and cold. In a large and busy HDFS network, hot data will have concurrent and intense access. Replicating hot data only on three different nodes is not , IJAFRSE All Rights Reserved

5 adequate to avoid contention. Also for cold data, three replicas may produce unnecessary overhead. ERMS,[5] is proposed is proposed which introduces an active/standby storage model which takes advantage of a high performance complex event processing engine to distinguish the real time data types and brings an elastic replication policy for the different types of data. It uses Condor to increase the replication factor of hot data in standby nodes and to remove extra replicas of cold data. Erasure codes can be used to save storage space and network bandwidth when hot data changes to cold data. 1. Improves data locality by keeping more replicas of hot data and less replicas of cold data. 2. Dynamically adapts to changes in data access patterns and data popularity and network overhead is less. 3. Enhances the reliability and availability of data. Limitations 1. Efficiency depends on a number of threshold values and hence the careful selection of threshold values is needed. 2. Memory requirement is high. DARE Placing data near to computation enhances performance in data intensive applications. DARE is an adaptive data replication mechanism that helps in achieving a high degree of data locality. DARE[6] adapts to the change in workload conditions. Each node executes the algorithm to create replicas independently of heavily accessed files in a short interval of time. Data with correlated access are distributed to various nodes as new replicas are created and old ones expire which also enhance data locality. No extra network overhead is incurred since it is making use of existing remote data retrieval. The algorithm creates replicas of popular files and at the same time minimizes the number of replicas of unpopular files. A greedy approach is used which incurs no extra network traffic since it is making use of existing remote data retrieval. Usually when a map task is launched the required data may be present locally or in a remote node. The remote data will be fetched and used without any local storage. But in DARE this remote data will be stored locally thereby increasing the replication factor by one. It also uses a replication budget to limit the storage consumed by the dynamically created replica. An Elephant Trap, a mechanism to find the largest flows in a network link structure is used to replicate popular files. A probabilistic approach is also used for the ageing mechanism. 1. Improves data locality with no extra network overhead. Turn around and slow down time is improved. 2. It is scheduler agnostic, so it can be used in parallel with other schemes , IJAFRSE All Rights Reserved

6 Limitations 1. Storage requirements are high. More data structures and synchronization mechanisms are needed. Intense modification of Hadoop is required. CHEETAH: Data warehouse built on top of Hadoop with flexibility and scalability is Cheetah[7]. Cheetah combines both data warehouse and Map/Reduce technologies. It makes easier for SQL user to work. Since data is usually stored in column format, it has the overhead of query reconstruction and cannot guarantee that all fields of the needed record are present in the same Data Node. Cheetah provides only identical layout for all replicas and no column grouping. For processing any data in the cell, it has to decompress the whole cell which will result in unnecessary column reads and I/O throughput is limited. : 1. It provides a simple query language which is easily understood by people having little SQL knowledge. 2. It utilizes Hadoop's optimization techniques for data compression, access methods and materialized views and provides high performance by processing 1 GB of raw data per second. Limitation: 1. Query reconstruction overhead and cannot guarantee that all fields of the needed record are present in the same Data Node. 2. Identical layout for all replicas and no column grouping. Unnecessary column reads and I/O throughput is limited. 3. Modification to the existing Hadoop framework is needed to incorporate the changes. In the proposed method, there is no need for the client to provide the hint explicitly. Hadoop Name Node can be used which will make the placement process quite easy. III. PROPOSED SYSTEM Clustering is the classification of objects (documents) into different groups in such a way that the objects in the same group will have some same characteristic. The common trait may be a defined feature vector such that it is within a defined proximity to the feature vector of the cluster in which the object may be placed. Clustering can be used in different contexts like news article feeds, data mining, machine learning, pattern recognition, Bioinformatics etc. Conventional clustering can be classified into hierarchical and partitional Clustering. Hierarchical clustering finds new clusters using the previously created clusters where as partitional clustering finds the clusters all at once. The main problem with hierarchical and partitional clustering is that they are computation ally intensive. They require more memory and reduce the clustering rate of the system. The growth of the intern et allowed massive dissemination of online data from Google, Yahoo etc. Conventional clustering methods are inadequate in these cases due to inaccuracies and huge delays. This is because these methods may need multiple iterations and whenever an object arrives, it may need to be compared with all of the existing clusters to find an appropriate one. Since the number of clusters is very large, this incurs much delay. So some alternate methods are needed which will improve the clustering of massive online data , IJAFRSE All Rights Reserved

7 Instead of using the above mentioned approaches, a method for clustering related documents incrementally in Hadoop - HDFS using Locality Sensitive Hashing (LSH)[8]. The main idea behind LSH is that hash functions are used for hashing data points to hash values or signatures and there is a high probability of collision for the data points which are similar i.e. similar data objects will be having the similar signatures or hash values. 1. Preprocessing File: File contain collection of words, file is pre-process means words like stop words are removed, stop word are word like a, of, the etc. and also stemming (historical is replace with history ) and many techniques are used to pre-process a file. After preprocessing file will contain collection of word which related to particular file and which can be use to represent that file. 2. File Vector: after preprocessing file which contains collection of words from that words which are presenting that file need to find, this is done using TFIDF technique. TF-IDF(Term Frequencies- Inverse Document Frequencies) technique finds words in file that come many times compare to all remaining files, which indicate that word is representing a file and it is important word in file. If word is representing that file then that word can be use to find similar files. 3. Create Signature - To find similar file it should be compared with content of each and every files available but there are millions of files which makes process time consuming. So to make process faster compact bit representation of each file vector is created, Signature. To create Signature f bit vector is used and this vector initialized to zero first then it hashed with file vector and comparing value is 0 or 1 weight of word will be incremented or decremented. Advantage of Signature is that similar file will have same Signature which makes process faster. 4. Use Locality Sensitive Hashing to store related files-for storing related files together LSH is used,for this hashing function is used to file Signature and all files with same signature are stored within same node in chunks. Now suppose client want execute task and system should not execute repeated tasks for this, cache will be implemented. Cache table will be created which stores file name, operation perform on that file and result file name. When client wants to execute any map task first then it request cache[9] manager to find file name and operation. If file name and operation performed on that file is same then result file name will be given to directly to reduce phase which completely save execution time of task. IV. CONCLUSION A new approach to incremental document clustering is proposed for HDFS which will cluster similar documents in the same set of data nodes with minimal changes to the existing framework. For faster clustering operations bit wise representation of the feature vectors called fingerprints or signatures are used. V. REFERENCES [1] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz and A. Rasin, HadoopDB: An architectural Hybrid of Map/Reduce and DBMS Technologies for Analytical Workloads, Proceedings of VLDB Endowment, 2(1), pp , August , IJAFRSE All Rights Reserved

8 [2] J. Dittrich, J.-A. Quian e-ruiz, A. Jindal, Y. Kargin, V. Setty and J. Schad, Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing), Proceedings of VLDB Endowment,3 (2), pp , September [3] M. Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, J. McPherson, CoHadoop: flexible data placement and its exploitation in Hadoop, Proceedings of the VLDB Endowment, 4 (9), pp , June [4] Z. Cheng et al., An Elastic Replication Management System for HDFS, Workshop on Interfaces and Abstractions for Scientific Data Storage, IEEE Cluster 2012, [5] S. Chen, Cheetah: A high performance, custom data warehouse on top of mapreduce, Proceedings of VLDB, 3(2), pp , [6] L. Abad, Y. Lu and R. H. Campbell, DARE: Adaptive Data Replication for Efficient Cluster Scheduling, Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER 11), pp , [7] Y. Lu, B. Prabhakar and F. Bonomi, ElephantTrap: A low cost device for identifying large flows, Proceedings of IEEE Symposium on High- Performance Interconnects, HOTI 07, pp , [8] Kala, K.A.; Chitharanjan, K., Locality Sensitive Hashing based incremental clustering for creating affinity groups in Hadoop HDFS - An infrastructure extension, Circuits, Power and Computing Technologies (ICCPCT), 2013 International Conference on, vol., no., pp.1243,1249, March 2013 [9] Yaxiong Zhao; Jie Wu, Dache: A data aware caching for big-data applications using the MapReduce framework, INFOCOM, 2013 Proceedings IEEE, vol., no., pp.35,39, April , IJAFRSE All Rights Reserved

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

More information

Clustering using Simhash and Locality Sensitive Hashing in Hadoop HDFS : An Infrastructure Extension

Clustering using Simhash and Locality Sensitive Hashing in Hadoop HDFS : An Infrastructure Extension ISTE-ACEEE Int. J. in Computer Science, Vol. 1, No. 1, March 2014 Clustering using Simhash and Locality Sensitive Hashing in Hadoop HDFS : An Infrastructure Extension Kala Karun.A and Chitharanjan. K Sree

More information

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

low-level storage structures e.g. partitions underpinning the warehouse logical table structures DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

A survey of big data architectures for handling massive data

A survey of big data architectures for handling massive data CSIT 6910 Independent Project A survey of big data architectures for handling massive data Jordy Domingos - jordydomingos@gmail.com Supervisor : Dr David Rossiter Content Table 1 - Introduction a - Context

More information

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010 Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Distributed File System. Jordan Prosch, Matt Kipps Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

In Memory Accelerator for MongoDB

In Memory Accelerator for MongoDB In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Design and Evolution of the Apache Hadoop File System(HDFS)

Design and Evolution of the Apache Hadoop File System(HDFS) Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle Agenda Introduction Database Architecture Direct NFS Client NFS Server

More information

GeoGrid Project and Experiences with Hadoop

GeoGrid Project and Experiences with Hadoop GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology

More information

Networking in the Hadoop Cluster

Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

HDFS Architecture Guide

HDFS Architecture Guide by Dhruba Borthakur Table of contents 1 Introduction... 3 2 Assumptions and Goals... 3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets... 3 2.4 Simple Coherency Model...3 2.5

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 697-701 STARFISH

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 697-701 STARFISH International Journal of Research in Information Technology (IJRIT) www.ijrit.com ISSN 2001-5569 A Self-Tuning System for Big Data Analytics - STARFISH Y. Sai Pramoda 1, C. Mary Sindhura 2, T. Hareesha

More information

Advances in Natural and Applied Sciences

Advances in Natural and Applied Sciences AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop

CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop : Flexible Data Placement and Its Exploitation in Hadoop 1 Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özcan Rainer Gemulla +, Aljoscha Krettek #, John McPherson IBM Almaden Research Center, USA {myeltaba,

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

HADOOP PERFORMANCE TUNING

HADOOP PERFORMANCE TUNING PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The

More information

HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

An improved task assignment scheme for Hadoop running in the clouds

An improved task assignment scheme for Hadoop running in the clouds Dai and Bassiouni Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:23 RESEARCH An improved task assignment scheme for Hadoop running in the clouds Wei Dai * and Mostafa Bassiouni

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 349 Load Balancing Heterogeneous Request in DHT-based P2P Systems Mrs. Yogita A. Dalvi Dr. R. Shankar Mr. Atesh

More information

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 F1: A Distributed SQL Database That Scales Presentation by: Alex Degtiar (adegtiar@cmu.edu) 15-799 10/21/2013 What is F1? Distributed relational database Built to replace sharded MySQL back-end of AdWords

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

MapReduce With Columnar Storage

MapReduce With Columnar Storage SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

More information

Efficient Analysis of Big Data Using Map Reduce Framework

Efficient Analysis of Big Data Using Map Reduce Framework Efficient Analysis of Big Data Using Map Reduce Framework Dr. Siddaraju 1, Sowmya C L 2, Rashmi K 3, Rahul M 4 1 Professor & Head of Department of Computer Science & Engineering, 2,3,4 Assistant Professor,

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage

A Brief Analysis on Architecture and Reliability of Cloud Based Data Storage Volume 2, No.4, July August 2013 International Journal of Information Systems and Computer Sciences ISSN 2319 7595 Tejaswini S L Jayanthy et al., Available International Online Journal at http://warse.org/pdfs/ijiscs03242013.pdf

More information

Bundling Hadoop & Map Reduce for Data-Intensive Computing in Distributed Systems

Bundling Hadoop & Map Reduce for Data-Intensive Computing in Distributed Systems Journal of Basic and Applied Engineering Research pp. 93-97 Krishi Sanskriti Publications http://www.krishisanskriti.org/jbaer.html Bundling Hadoop & Map Reduce for Data-Intensive Computing in Distributed

More information

Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

More information

Hadoop Cluster Applications

Hadoop Cluster Applications Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information