Shared Disk Big Data Analytics with Apache Hadoop

Size: px
Start display at page:

Download "Shared Disk Big Data Analytics with Apache Hadoop"

Transcription

1 Shared Disk Big Data Analytics with Apache Hadoop Anirban Mukherjee, Joydip Datta, Raghavendra Jorapur, Ravi Singhvi, Saurav Haloi, Wasim Akram {Anirban_Mukherjee, Joydip_Datta, Raghavendra_Jorapur, Ravi_Singhvi, Saurav_Haloi, Symantec Corporation ICON, Baner Road, Pune , India Abstract Big Data is a term applied to data sets whose size is beyond the ability of traditional software technologies to capture, store, manage and process within a tolerable elapsed time. The popular assumption around Big Data analytics is that it requires internet scale scalability: over hundreds of compute nodes with attached storage. In this paper, we debate on the need of a massively scalable distributed computing platform for Big Data analytics in traditional businesses. For organizations which don t need a horizontal, internet order scalability in their analytics platform, Big Data analytics can be built on top of a traditional POSIX Cluster File Systems employing a shared storage model. In this study, we compared a widely used clustered file system: VERITAS Cluster File System (SF-CFS) with Hadoop Distributed File System () using popular Map-reduce benchmarks like Terasort, DFS-IO and Gridmix on top of Apache Hadoop. In our experiments VxCFS could not only match the performance of, but also outperformed in many cases. This way, enterprises can fulfill their Big Data analytics need with a traditional and existing shared storage model without migrating to a different storage model in their data centers. This also includes other benefits like stability & robustness, a rich set of features and compatibility with traditional analytics applications. Keywords--BigData; Hadoop; Clustered File Systems; Analytics; Cloud I. INTRODUCTION The exponential growth of data over the last decade has introduced a new domain in the field of information technology called Big Data. Datasets that stretches the limits of traditional data processing and storage systems is often referred to as Big Data. The need to process and analyze such massive datasets has introduced a new form of data analytics called Big Data Analytics. Big Data analytics involves analyzing large amounts of data of a variety of types to uncover hidden patterns, unknown correlations and other useful information. Many organizations are increasingly using Big Data analytics to get better insights into their businesses, increase their revenue and profitability and gain competitive advantages over rival organizations. The characteristics of Big Data can be broadly divided into four Vs i.e. Volume, Velocity, Varity and Variability. Volume refers to the size of the data. While Velocity tells about the pace at which data is generated; Varity and Variability tells us about the complexity and structure of data and different ways of interpreting it. A common notion about the applications which consume or analyze Big Data is that they require a massively scalable and parallel infrastructure. This notion is correct and makes sense for internet scale organizations like Facebook or Google. However, for traditional enterprise businesses this is typically not the case. As per Apache Hadoop wiki [3], significant number of deployments of Hadoop in enterprises typically doesn t exceed 16 nodes. In such scenarios, the role of traditional storage model with shared storage and a clustered file system on top of it, to serve the need to traditional as well as Big Data analytics cannot be totally ruled out. Big Data analytics platform in today s world often refers to the Map-Reduce framework, developed by Google [4], and the tools and ecosystem built around it. Map-Reduce framework provides a programming model using map and reduce functions over keyvalue pairs that can be executed in parallel on a large cluster of compute nodes. Apache Hadoop [1] is an open source implementation of Google s Map-Reduce model, and has become extremely popular over the years for building Big Data analytics platform. The other key aspect of Big Data analytics is to push the computation near the data. Generally, in a Map-Reduce environment, the compute and storage nodes are the same, i.e. the computational tasks run on the same set of nodes that hold the data required for the computations. By default, Apache Hadoop uses Hadoop Distributed File System () [2] as the underlying storage backend, but it is designed to work with other file systems as well. is not a POSIX-compliant file system, and once data is written it is not modifiable (a write-once, read-many access model). protects /12/$ IEEE

2 data by replicating data blocks across multiple nodes, with a default replication factor of 3. In this paper, we try to gather a credible reasoning behind the need of a new non-posix storage stack for Big Data analytics and advocate, based on evaluation and analysis that such a platform can be built on traditional POSIX based cluster file systems. Traditional cluster file systems are often looked at with a whim that it requires expensive high end servers with state of the art SAN. But contrary to such impressons, these file systems can be configured using commodity or mid-range servers for lower costs. More importantly, these file systems can support traditional applications that rely on POSIX API s. The extensive availability of tools, software applications and human expertise are other add-ons to these file systems. Similar efforts are undertaken by IBM Research [5], where they have introduced a concept of metablock in GPFS to enable the choice of a larger block granularity for Map/Reduce applications to coexist with a smaller block granularity required for traditional applications, and have compared the performance of GPFS with for Map/Reduce workloads The rest of the paper is organized as follows. Section 2 describes the concept of shared Big Data analytics. Section 3 describes the architecture of the Hadoop connector for VERITAS Cluster File System. Section 4 describes our experimental setup followed by our experiments and results in section 5. Section 6 tells us about additional use cases and benefits of our solution. The future work as a continuation to our current proposal has been described in section 7 followed by conclusion and citations. II. SHARED DISK BIG DATA ANALYTICS In our study we compare the performance of Hadoop Distributed File System (), the de-facto file system in Apache Hadoop with a commercial cluster file system called VERITAS Storage Foundation Cluster File System (SF-CFS) by Symantec, with a variety of workloads and map reduce applications. We show that a clustered file system can actually match the performance of for map/reduce workloads and can even outperform it for some cases. We have used VMware virtual machines as compute nodes in our cluster and have used a mid-level storage array (Hitachi HUS130) for our study. While we understand that comparing a clustered file system running on top of a SAN to that of a distributed file system running on local disks is not an apple to apple comparison, but the study is mostly directed towards getting a proper and correct reasoning (if any) behind the notion of introducing a new storage model for Big Data analytics in datacenters of enterprises and organizations which are not operating at an internet scale. To have an estimate, we have run the same workload with in a SAN environment. Both SF-CFS and has been configured with their default settings/tunable in our experiments. We have developed a file system connector module for SF-CFS to make it work inside Apache Hadoop platform as the backend file system replacing altogether and also have taken advantage of SF- CFS s potential by implementing the native interfaces from this module. Our shared disk Big Data analytics solution doesn t need any change in the Map Reduce applications. Just by setting a few parameters in the configuration of Apache Hadoop, the whole Big Data analytics platform can be made up and running very quickly. III. ARCHITECTURE The clustered file system connector module we developed for Apache Hadoop platform has a very simple architecture. It removes the functionality from the Hadoop stack and replaces it with VERITAS Clustered File System. It introduces SF-CFS to the Hadoop class by implementing the APIs which are used for communication between Map/Reduce Framework and the File System. This could be achieved because the Map-Reduce framework always talks in terms of a well-defined FileSystem [6] API for each data access. The FileSystem API is an abstract class which the file serving technology underneath Hadoop must implement. Both and our clustered file system connector module implement this FileSystem class, as shown in Figure 1. Figure 1. Architecture of SF-CFS Hadoop Connector VERITAS Clustered File System being a parallel shared data file system, the file system namespace and the data is available to all the nodes in the cluster at any given point of time. Unlike, where a Name Node maintains the metadata information of the whole file system namespace, with SF-CFS all the nodes in the cluster can serve the metadata. Hence a query from Map Reduce framework pertaining to data locality can always be resolved by the compute node itself. The benefit of such a resolution is the elimination of extra hops traversed with in scenarios when data is not locally available. Also, the data need not be replicated across data nodes in case of a clustered file system.

3 Since, all the nodes have access to the data; we can say that the replication factor in SF-CFS is equivalent to the with replication factor equal to the no. of nodes in the cluster. This architecture does away with the risk of losing data when a data node dies and minimum replication was not achieved for that chunk of data. The usage of RAID technologies and vendor SLAs in storage arrays used in SAN environment can account to overall reliability of the data. IV. EXPERIMENTAL SETUP In this study on shared disk Big Data analytics, we have compared (Apache Hadoop 1.0.2) which is the default file system of Apache Hadoop and Symantec Corporation s VERITAS Cluster File System (SFCFSHA 6.0) which is widely deployed by enterprises and organizations in banking, financial, telecom, aviation and various other sectors. The hardware configuration for our evaluation comprises of an 8 node cluster with VMware virtual machines on ESX4.1. Each VM has been hosted on individual ESX hosts and has 8 vcpus of 2.67GHz and 16GB physical memory. The cluster nodes are interconnected with a 1Gbps network link dedicated to Map Reduce traffic through a DLink switch. Shared storage for clustered file system is carved with SAS disks from a mid-range Hitachi HUS130 array and direct attached storage for is made available from local SAS disks of the ESX hosts. Each of the compute node virtual machine is running on Linux (RHEL6.2). Performance of has been measured both with SAN as well as DAS. The setup for -SAN consists of the same storage LUNs used for SF-CFS, but configuring in such a way that no two nodes see the same storage, so as to emulate a local disk kind of scenario. -Local setup uses the DAS of each of the compute nodes. In both cases, we used ext4 as the primary file system. The following table summarizes the various scenarios we compared: Scenario SF-CFS -SAN (1) -SAN (3) -Local (1) -Local (3) Description Our solution in SAN with replication factor 1 in SAN with replication factor 3 in Local Disks (DAS) with replication factor 1 in Local Disks (DAS) with replication factor 3 V. EXPERIMENTS We have used TeraSort, TestDFSIO, MRbench and GridMix3 [7] for comparing the performance of SF-CFS and. These are widely used map/reduce benchmarks and are available pre-packaged inside Apache Hadoop distribution. In our performance evaluation, for TestDFSIO and TeraSort, we have done the comparison for replication factor of 1 as well as 3. We have used block size of 64MB for both (dfs.blocksize) and SF- CFS (fs.local.block.size) in our experiments. TeraSort: TeraSort is a Map Reduce application to do parallel merge sort on the keys in the data set generated by TeraGen. It is a benchmark that combines testing the and Map Reduce layers of a Hadoop cluster. A full TeraSort benchmark run consists of the following three steps: 1. Generate the input data with TeraGen 2. Run TeraSort on the input data 3. Validate the sorted output data using TeraValidate Figure 2: TeraSort Hadoop TeraSort is a Map Reduce job with a custom partitioner that uses a sorted list of n-1 sampled keys that define the key range for each reduce. Figure: 2 above illustrate the behavior of TeraSort benchmark for a dataset size of 10GB and 100GB. As observed, SF- CFS performs better than in all the different scenarios. TABLE 1. TIME TAKEN FOR TERASORT (LOWER IS BETTER) Dataset GB SF- CFS SAN(1) SAN(3) Local(1) Local(3) In map/reduce framework, the no. of map tasks for a job is proportional to the input dataset size for constant file system block size. Hence, the increase of dataset

4 size leads to higher concurrency and load at the file system as well as storage layer in case of a shared file system. Due to this, the performance gap between SF- CFS and is observed to have decreased with increase in dataset size. TestDFSIO: TestDFSIO is a distributed I/O benchmark which tests the I/O performance of the file system in a Hadoop cluster. It does this by using a Map Reduce job as a convenient way to read or write files in parallel. Each file is read or written in a separate map task [8]. evenly distributed/replicated across all nodes, which a shared file system lacks. In TestDFSIO write, it is observed that with DAS with a replication factor of 1 outperforms SF- CFS. This performance improvement however comes at the cost of data loss in the event of node failures. In all other cases, SF-CFS performs similar or better than for TestDFSIO write workload. MRBench: MRbench benchmarks a Hadoop cluster by running small jobs repeated over a number of times. It tries to check the responsiveness of the Map Reduce framework running in a cluster for small jobs. It puts its focus on the Map Reduce layer and as its impact on the file system layer of Hadoop is minimal. In our evaluation we ran MRbench jobs repeated 50 times for SF-CFS, in SAN and in local disks for replication factor of 3. The average response time reported by MRBench in milliseconds was found to be best for SF-CFS: TABLE 2. RESPONSE TIME OF MRBENCH (LOWER IS BETTER) Figure 3: TestDFSIO Read (higher is better) Use Case AvgTime (msec) SF-CFS SAN(3) Local(3) GridMix3: Figure 4: TestDFSIO Write (higher is better) In our study, the performance tests are done on 64MB files and varying the number of files for different test scenarios as illustrated in figure 3 and 4 above. It has been observed that significantly outperforms SF-CFS in read for both replication factor of 1 and 3. This is due to the fact that pre-fetches an entire chunk of data equal to the block size and don t suffer from any cache coherence issues with its write-once semantics. -Local(3) gives an added advantage of read parallelism equal to the number of compute nodes, assuming the blocks are GridMix3 is used to simulate Map Reduce load on a Hadoop cluster by emulating real time load mined from production Hadoop clusters. The goal of GridMix3 is to generate a realistic workload on a cluster to validate cluster utilization and measure Map Reduce as well as file system performance by replaying job traces from Hadoop clusters that automatically capture essential ingredients of job executions. In our experiments, we used the job trace available from Apache Subversion, for dataset size of 64GB, 128GB and 256GB. We observed that SF-CFS performed better than in SAN as well as in local disks with replication factor of 3.

5 Figure 5: Time (s) Taken by GridMix3 (lower is better) In the course of our study, we also compared the performance of SF-CFS with using the SWIM [9] benchmark by running Facebook job traces and have observed SF-CFS to perform better or at par with. SWIM contains suites of workloads of thousands of jobs, with complex data, arrival, and computation patterns which enables rigorous performance measurement of Map/Reduce systems. VI. ADDITIONAL CONSIDERATIONS In addition to comparable performance exhibited by SF-CFS for various Map/Reduce workloads and applications, SF-CFS provides the benefits of being a robust, stable and highly reliable file system. It gives the ability run analytics on top of existing data using existing analytics tools and applications, which eliminates the need for copy-in and copy-out of data from a Hadoop cluster, saving significant amount of time. SF-CFS also supports data ingestion over NFS. Along with all these, it brings in other standard features like snapshot, compression, file level replication and de-duplication etc. For example, gzip compression for the input splits with is not possible as it is impossible to start reading at an arbitrary point in a gzip stream, and a map task can t read its split independently of the others [8]. However, if compression is enabled in SF-CFS, the file system will perform the decompression and return the data to applications in a transparent way. Data backups, disaster recovery are other built in benefits of using SF-CFS for big data analytics. SF-CFS solution for Hadoop, also known as Symantec Enterprise Solution for Hadoop is available as a free download for SF-CFS customers of Symantec Corporation [10]. VII. FUTURE WORK During our study of the performance exhibited by a commercial cluster file system in Map/Reduce workloads and its comparison with a distributed file system, we observed that a significant amount of time is spent during the copy phase of the Map/Reduce model after map task finishes. In Hadoop platform, the input and output data of Map/Reduce jobs are stored in, with intermediate data generated by Map tasks are stored in the local file system of the Mapper nodes and are copied (shuffled) via HTTP to Reducer nodes. The time taken to copy this intermediate map outputs increase proportionately to the size of the data. However, since in case of a clustered file system, all the nodes see all the data, this copy phase can be avoided by keeping the intermediate data in the clustered file system as well and directly reading it from there by the reducer nodes. This endeavor will completely eliminate the copy phase after map is over and bound to give a significant boost to overall performance of Map/Reduce jobs. This will require changes in the logic and code of Map/Reduce framework implemented inside Apache Hadoop and we keep it as a future work for us. VIII. CONCLUSIONS From all the performance benchmark numbers and their analysis, it can be confidently reasoned that for Big Data analytics need, traditional shared storage model cannot be totally ruled out. While due to architectural and design issues, a cluster file system may not scale at the same rate as a shared nothing model does, but for use cases where internet order scalability is not required, a clustered file system can do a decent job even in the Big Data analytics domain. A clustered file system like SF-CFS can provide numerous other benefits with its plethora of features. This decade has seen the success of virtualization which introduced the recent trends of server consolidation, green computing initiatives in enterprises. Big data analytics with a clustered file system from the existing infrastructure aligns into this model and direction. A careful study of the need and use cases is required before building a Big Data analytics platform, rather than going with the notion that shared nothing model is the only answer to Big Data needs. ACKNOWLEDGEMENTS We would like to thank the anonymous reviewers at Symantec for their feedback on this work; Niranjan Pendharkar, Mayuresh Kulkarni and Yatish Jain contributed to the early design of the clustered file system connector module for Apache Hadoop platform. REFERENCES [1] Apache Hadoop. hadoop.apache.org [2] K. Shvachko, Hairong Kuang, S. Radia and R Chansler The Hadoop Distributed File System. Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium 3-7 May 2010 [3] Powered by Hadoop. [4] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Sixth Symposium on Operating System Design and Implementation, pages , December 2004 [5] Rajagopal Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu Pucha,Prasenjit Sarkar, Mansi Shah, Renu Tewari, IBM Research

6 Cloud Analytics: Do We Really Need to Reinvent the Storage Stack? USENIX: HotCloud '09 [6] Apache Hadoop File System APIs. hadoop.apache.org/common/docs/current/api/org/apache/hadoo p/fs/filesystem.html [7] GridMix3. developer.yahoo.com/blogs/hadoop/posts/2010/04/gridmix3_e mulating_production/ [8] Hadoop: The Definitive Guide, Third Edition by Tom White (O Reilly ) [9] SWIMProjectUCB. github.com/swimprojectucb/swim/wiki [10] Symantec Enterprise Solution for Hadoop symantec.com/enterprise-solution-for-hadoop

Big Data Analytics processing with Apache Hadoop storage

Big Data Analytics processing with Apache Hadoop storage Big Data Analytics processing with Apache Hadoop storage R.Gayathri 1, M.BalaAnand 2 1 Assistant Professor, Computer Science and Engineering, V.R.S. College of Engineering and Technology, Tamil Nadu, India

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Storage Architectures for Big Data in the Cloud

Storage Architectures for Big Data in the Cloud Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013 Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas

More information

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Big Data for Processing Data and Performing Workload Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce

More information

HDFS Space Consolidation

HDFS Space Consolidation HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director [email protected] Dave Smelker, Managing Principal [email protected]

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of. Hadoop deployment models Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced

More information

Introduction to Gluster. Versions 3.0.x

Introduction to Gluster. Versions 3.0.x Introduction to Gluster Versions 3.0.x Table of Contents Table of Contents... 2 Overview... 3 Gluster File System... 3 Gluster Storage Platform... 3 No metadata with the Elastic Hash Algorithm... 4 A Gluster

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

VMware Virtual Machine File System: Technical Overview and Best Practices

VMware Virtual Machine File System: Technical Overview and Best Practices VMware Virtual Machine File System: Technical Overview and Best Practices A VMware Technical White Paper Version 1.0. VMware Virtual Machine File System: Technical Overview and Best Practices Paper Number:

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction

Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction There are tectonic changes to storage technology that the IT industry hasn t seen for many years. Storage has been

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

With Red Hat Enterprise Virtualization, you can: Take advantage of existing people skills and investments

With Red Hat Enterprise Virtualization, you can: Take advantage of existing people skills and investments RED HAT ENTERPRISE VIRTUALIZATION DATASHEET RED HAT ENTERPRISE VIRTUALIZATION AT A GLANCE Provides a complete end-toend enterprise virtualization solution for servers and desktop Provides an on-ramp to

More information

Whitepaper. NexentaConnect for VMware Virtual SAN. Full Featured File services for Virtual SAN

Whitepaper. NexentaConnect for VMware Virtual SAN. Full Featured File services for Virtual SAN Whitepaper NexentaConnect for VMware Virtual SAN Full Featured File services for Virtual SAN Table of Contents Introduction... 1 Next Generation Storage and Compute... 1 VMware Virtual SAN... 2 Highlights

More information

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack HIGHLIGHTS Real-Time Results Elasticsearch on Cisco UCS enables a deeper

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

PARALLELS CLOUD STORAGE

PARALLELS CLOUD STORAGE PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

High Availability with Windows Server 2012 Release Candidate

High Availability with Windows Server 2012 Release Candidate High Availability with Windows Server 2012 Release Candidate Windows Server 2012 Release Candidate (RC) delivers innovative new capabilities that enable you to build dynamic storage and availability solutions

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Microsoft Private Cloud Fast Track

Microsoft Private Cloud Fast Track Microsoft Private Cloud Fast Track Microsoft Private Cloud Fast Track is a reference architecture designed to help build private clouds by combining Microsoft software with Nutanix technology to decrease

More information

GeoGrid Project and Experiences with Hadoop

GeoGrid Project and Experiences with Hadoop GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

OnX Big Data Reference Architecture

OnX Big Data Reference Architecture OnX Big Data Reference Architecture Knowledge is Power when it comes to Business Strategy The business landscape of decision-making is converging during a period in which: > Data is considered by most

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

A Survey of Shared File Systems

A Survey of Shared File Systems Technical Paper A Survey of Shared File Systems Determining the Best Choice for your Distributed Applications A Survey of Shared File Systems A Survey of Shared File Systems Table of Contents Introduction...

More information

HadoopTM Analytics DDN

HadoopTM Analytics DDN DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate

More information

Scala Storage Scale-Out Clustered Storage White Paper

Scala Storage Scale-Out Clustered Storage White Paper White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current

More information

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division In this talk Big data storage: Current trends Issues with current storage options Evolution of storage to support big

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Networking in the Hadoop Cluster

Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

More information

TECHNICAL PAPER. Veeam Backup & Replication with Nimble Storage

TECHNICAL PAPER. Veeam Backup & Replication with Nimble Storage TECHNICAL PAPER Veeam Backup & Replication with Nimble Storage Document Revision Date Revision Description (author) 11/26/2014 1. 0 Draft release (Bill Roth) 12/23/2014 1.1 Draft update (Bill Roth) 2/20/2015

More information

I/O Considerations in Big Data Analytics

I/O Considerations in Big Data Analytics Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI

More information

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA Outline

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research Introduction to Cloud : Cloud and Cloud Storage Lecture 2 Dr. Dalit Naor IBM Haifa Research Storage Systems 1 Advanced Topics in Storage Systems for Big Data - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

Nutanix NOS 4.0 vs. Scale Computing HC3

Nutanix NOS 4.0 vs. Scale Computing HC3 Nutanix NOS 4.0 vs. Scale Computing HC3 HC3 Nutanix Integrated / Included Hypervisor Software! requires separate hypervisor licensing, install, configuration, support, updates Shared Storage benefits w/o

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,

More information

The Methodology Behind the Dell SQL Server Advisor Tool

The Methodology Behind the Dell SQL Server Advisor Tool The Methodology Behind the Dell SQL Server Advisor Tool Database Solutions Engineering By Phani MV Dell Product Group October 2009 Executive Summary The Dell SQL Server Advisor is intended to perform capacity

More information

Big data management with IBM General Parallel File System

Big data management with IBM General Parallel File System Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

1. Comments on reviews a. Need to avoid just summarizing web page asks you for: 1. Comments on reviews a. Need to avoid just summarizing web page asks you for: i. A one or two sentence summary of the paper ii. A description of the problem they were trying to solve iii. A summary of

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Understanding Hadoop Performance on Lustre

Understanding Hadoop Performance on Lustre Understanding Hadoop Performance on Lustre Stephen Skory, PhD Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference 15

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

Towards MapReduce Performance Optimization: A Look into the Optimization Techniques in Apache Hadoop for BigData Analytics

Towards MapReduce Performance Optimization: A Look into the Optimization Techniques in Apache Hadoop for BigData Analytics Towards MapReduce Performance Optimization: A Look into the Optimization Techniques in Apache Hadoop for BigData Analytics Kudakwashe Zvarevashe 1, Dr. A Vinaya Babu 2 1 M Tech Student, Dept of CSE, Jawaharlal

More information