Shared Disk Big Data Analytics with Apache Hadoop
|
|
|
- Rosalyn Wells
- 10 years ago
- Views:
Transcription
1 Shared Disk Big Data Analytics with Apache Hadoop Anirban Mukherjee, Joydip Datta, Raghavendra Jorapur, Ravi Singhvi, Saurav Haloi, Wasim Akram {Anirban_Mukherjee, Joydip_Datta, Raghavendra_Jorapur, Ravi_Singhvi, Saurav_Haloi, Symantec Corporation ICON, Baner Road, Pune , India Abstract Big Data is a term applied to data sets whose size is beyond the ability of traditional software technologies to capture, store, manage and process within a tolerable elapsed time. The popular assumption around Big Data analytics is that it requires internet scale scalability: over hundreds of compute nodes with attached storage. In this paper, we debate on the need of a massively scalable distributed computing platform for Big Data analytics in traditional businesses. For organizations which don t need a horizontal, internet order scalability in their analytics platform, Big Data analytics can be built on top of a traditional POSIX Cluster File Systems employing a shared storage model. In this study, we compared a widely used clustered file system: VERITAS Cluster File System (SF-CFS) with Hadoop Distributed File System () using popular Map-reduce benchmarks like Terasort, DFS-IO and Gridmix on top of Apache Hadoop. In our experiments VxCFS could not only match the performance of, but also outperformed in many cases. This way, enterprises can fulfill their Big Data analytics need with a traditional and existing shared storage model without migrating to a different storage model in their data centers. This also includes other benefits like stability & robustness, a rich set of features and compatibility with traditional analytics applications. Keywords--BigData; Hadoop; Clustered File Systems; Analytics; Cloud I. INTRODUCTION The exponential growth of data over the last decade has introduced a new domain in the field of information technology called Big Data. Datasets that stretches the limits of traditional data processing and storage systems is often referred to as Big Data. The need to process and analyze such massive datasets has introduced a new form of data analytics called Big Data Analytics. Big Data analytics involves analyzing large amounts of data of a variety of types to uncover hidden patterns, unknown correlations and other useful information. Many organizations are increasingly using Big Data analytics to get better insights into their businesses, increase their revenue and profitability and gain competitive advantages over rival organizations. The characteristics of Big Data can be broadly divided into four Vs i.e. Volume, Velocity, Varity and Variability. Volume refers to the size of the data. While Velocity tells about the pace at which data is generated; Varity and Variability tells us about the complexity and structure of data and different ways of interpreting it. A common notion about the applications which consume or analyze Big Data is that they require a massively scalable and parallel infrastructure. This notion is correct and makes sense for internet scale organizations like Facebook or Google. However, for traditional enterprise businesses this is typically not the case. As per Apache Hadoop wiki [3], significant number of deployments of Hadoop in enterprises typically doesn t exceed 16 nodes. In such scenarios, the role of traditional storage model with shared storage and a clustered file system on top of it, to serve the need to traditional as well as Big Data analytics cannot be totally ruled out. Big Data analytics platform in today s world often refers to the Map-Reduce framework, developed by Google [4], and the tools and ecosystem built around it. Map-Reduce framework provides a programming model using map and reduce functions over keyvalue pairs that can be executed in parallel on a large cluster of compute nodes. Apache Hadoop [1] is an open source implementation of Google s Map-Reduce model, and has become extremely popular over the years for building Big Data analytics platform. The other key aspect of Big Data analytics is to push the computation near the data. Generally, in a Map-Reduce environment, the compute and storage nodes are the same, i.e. the computational tasks run on the same set of nodes that hold the data required for the computations. By default, Apache Hadoop uses Hadoop Distributed File System () [2] as the underlying storage backend, but it is designed to work with other file systems as well. is not a POSIX-compliant file system, and once data is written it is not modifiable (a write-once, read-many access model). protects /12/$ IEEE
2 data by replicating data blocks across multiple nodes, with a default replication factor of 3. In this paper, we try to gather a credible reasoning behind the need of a new non-posix storage stack for Big Data analytics and advocate, based on evaluation and analysis that such a platform can be built on traditional POSIX based cluster file systems. Traditional cluster file systems are often looked at with a whim that it requires expensive high end servers with state of the art SAN. But contrary to such impressons, these file systems can be configured using commodity or mid-range servers for lower costs. More importantly, these file systems can support traditional applications that rely on POSIX API s. The extensive availability of tools, software applications and human expertise are other add-ons to these file systems. Similar efforts are undertaken by IBM Research [5], where they have introduced a concept of metablock in GPFS to enable the choice of a larger block granularity for Map/Reduce applications to coexist with a smaller block granularity required for traditional applications, and have compared the performance of GPFS with for Map/Reduce workloads The rest of the paper is organized as follows. Section 2 describes the concept of shared Big Data analytics. Section 3 describes the architecture of the Hadoop connector for VERITAS Cluster File System. Section 4 describes our experimental setup followed by our experiments and results in section 5. Section 6 tells us about additional use cases and benefits of our solution. The future work as a continuation to our current proposal has been described in section 7 followed by conclusion and citations. II. SHARED DISK BIG DATA ANALYTICS In our study we compare the performance of Hadoop Distributed File System (), the de-facto file system in Apache Hadoop with a commercial cluster file system called VERITAS Storage Foundation Cluster File System (SF-CFS) by Symantec, with a variety of workloads and map reduce applications. We show that a clustered file system can actually match the performance of for map/reduce workloads and can even outperform it for some cases. We have used VMware virtual machines as compute nodes in our cluster and have used a mid-level storage array (Hitachi HUS130) for our study. While we understand that comparing a clustered file system running on top of a SAN to that of a distributed file system running on local disks is not an apple to apple comparison, but the study is mostly directed towards getting a proper and correct reasoning (if any) behind the notion of introducing a new storage model for Big Data analytics in datacenters of enterprises and organizations which are not operating at an internet scale. To have an estimate, we have run the same workload with in a SAN environment. Both SF-CFS and has been configured with their default settings/tunable in our experiments. We have developed a file system connector module for SF-CFS to make it work inside Apache Hadoop platform as the backend file system replacing altogether and also have taken advantage of SF- CFS s potential by implementing the native interfaces from this module. Our shared disk Big Data analytics solution doesn t need any change in the Map Reduce applications. Just by setting a few parameters in the configuration of Apache Hadoop, the whole Big Data analytics platform can be made up and running very quickly. III. ARCHITECTURE The clustered file system connector module we developed for Apache Hadoop platform has a very simple architecture. It removes the functionality from the Hadoop stack and replaces it with VERITAS Clustered File System. It introduces SF-CFS to the Hadoop class by implementing the APIs which are used for communication between Map/Reduce Framework and the File System. This could be achieved because the Map-Reduce framework always talks in terms of a well-defined FileSystem [6] API for each data access. The FileSystem API is an abstract class which the file serving technology underneath Hadoop must implement. Both and our clustered file system connector module implement this FileSystem class, as shown in Figure 1. Figure 1. Architecture of SF-CFS Hadoop Connector VERITAS Clustered File System being a parallel shared data file system, the file system namespace and the data is available to all the nodes in the cluster at any given point of time. Unlike, where a Name Node maintains the metadata information of the whole file system namespace, with SF-CFS all the nodes in the cluster can serve the metadata. Hence a query from Map Reduce framework pertaining to data locality can always be resolved by the compute node itself. The benefit of such a resolution is the elimination of extra hops traversed with in scenarios when data is not locally available. Also, the data need not be replicated across data nodes in case of a clustered file system.
3 Since, all the nodes have access to the data; we can say that the replication factor in SF-CFS is equivalent to the with replication factor equal to the no. of nodes in the cluster. This architecture does away with the risk of losing data when a data node dies and minimum replication was not achieved for that chunk of data. The usage of RAID technologies and vendor SLAs in storage arrays used in SAN environment can account to overall reliability of the data. IV. EXPERIMENTAL SETUP In this study on shared disk Big Data analytics, we have compared (Apache Hadoop 1.0.2) which is the default file system of Apache Hadoop and Symantec Corporation s VERITAS Cluster File System (SFCFSHA 6.0) which is widely deployed by enterprises and organizations in banking, financial, telecom, aviation and various other sectors. The hardware configuration for our evaluation comprises of an 8 node cluster with VMware virtual machines on ESX4.1. Each VM has been hosted on individual ESX hosts and has 8 vcpus of 2.67GHz and 16GB physical memory. The cluster nodes are interconnected with a 1Gbps network link dedicated to Map Reduce traffic through a DLink switch. Shared storage for clustered file system is carved with SAS disks from a mid-range Hitachi HUS130 array and direct attached storage for is made available from local SAS disks of the ESX hosts. Each of the compute node virtual machine is running on Linux (RHEL6.2). Performance of has been measured both with SAN as well as DAS. The setup for -SAN consists of the same storage LUNs used for SF-CFS, but configuring in such a way that no two nodes see the same storage, so as to emulate a local disk kind of scenario. -Local setup uses the DAS of each of the compute nodes. In both cases, we used ext4 as the primary file system. The following table summarizes the various scenarios we compared: Scenario SF-CFS -SAN (1) -SAN (3) -Local (1) -Local (3) Description Our solution in SAN with replication factor 1 in SAN with replication factor 3 in Local Disks (DAS) with replication factor 1 in Local Disks (DAS) with replication factor 3 V. EXPERIMENTS We have used TeraSort, TestDFSIO, MRbench and GridMix3 [7] for comparing the performance of SF-CFS and. These are widely used map/reduce benchmarks and are available pre-packaged inside Apache Hadoop distribution. In our performance evaluation, for TestDFSIO and TeraSort, we have done the comparison for replication factor of 1 as well as 3. We have used block size of 64MB for both (dfs.blocksize) and SF- CFS (fs.local.block.size) in our experiments. TeraSort: TeraSort is a Map Reduce application to do parallel merge sort on the keys in the data set generated by TeraGen. It is a benchmark that combines testing the and Map Reduce layers of a Hadoop cluster. A full TeraSort benchmark run consists of the following three steps: 1. Generate the input data with TeraGen 2. Run TeraSort on the input data 3. Validate the sorted output data using TeraValidate Figure 2: TeraSort Hadoop TeraSort is a Map Reduce job with a custom partitioner that uses a sorted list of n-1 sampled keys that define the key range for each reduce. Figure: 2 above illustrate the behavior of TeraSort benchmark for a dataset size of 10GB and 100GB. As observed, SF- CFS performs better than in all the different scenarios. TABLE 1. TIME TAKEN FOR TERASORT (LOWER IS BETTER) Dataset GB SF- CFS SAN(1) SAN(3) Local(1) Local(3) In map/reduce framework, the no. of map tasks for a job is proportional to the input dataset size for constant file system block size. Hence, the increase of dataset
4 size leads to higher concurrency and load at the file system as well as storage layer in case of a shared file system. Due to this, the performance gap between SF- CFS and is observed to have decreased with increase in dataset size. TestDFSIO: TestDFSIO is a distributed I/O benchmark which tests the I/O performance of the file system in a Hadoop cluster. It does this by using a Map Reduce job as a convenient way to read or write files in parallel. Each file is read or written in a separate map task [8]. evenly distributed/replicated across all nodes, which a shared file system lacks. In TestDFSIO write, it is observed that with DAS with a replication factor of 1 outperforms SF- CFS. This performance improvement however comes at the cost of data loss in the event of node failures. In all other cases, SF-CFS performs similar or better than for TestDFSIO write workload. MRBench: MRbench benchmarks a Hadoop cluster by running small jobs repeated over a number of times. It tries to check the responsiveness of the Map Reduce framework running in a cluster for small jobs. It puts its focus on the Map Reduce layer and as its impact on the file system layer of Hadoop is minimal. In our evaluation we ran MRbench jobs repeated 50 times for SF-CFS, in SAN and in local disks for replication factor of 3. The average response time reported by MRBench in milliseconds was found to be best for SF-CFS: TABLE 2. RESPONSE TIME OF MRBENCH (LOWER IS BETTER) Figure 3: TestDFSIO Read (higher is better) Use Case AvgTime (msec) SF-CFS SAN(3) Local(3) GridMix3: Figure 4: TestDFSIO Write (higher is better) In our study, the performance tests are done on 64MB files and varying the number of files for different test scenarios as illustrated in figure 3 and 4 above. It has been observed that significantly outperforms SF-CFS in read for both replication factor of 1 and 3. This is due to the fact that pre-fetches an entire chunk of data equal to the block size and don t suffer from any cache coherence issues with its write-once semantics. -Local(3) gives an added advantage of read parallelism equal to the number of compute nodes, assuming the blocks are GridMix3 is used to simulate Map Reduce load on a Hadoop cluster by emulating real time load mined from production Hadoop clusters. The goal of GridMix3 is to generate a realistic workload on a cluster to validate cluster utilization and measure Map Reduce as well as file system performance by replaying job traces from Hadoop clusters that automatically capture essential ingredients of job executions. In our experiments, we used the job trace available from Apache Subversion, for dataset size of 64GB, 128GB and 256GB. We observed that SF-CFS performed better than in SAN as well as in local disks with replication factor of 3.
5 Figure 5: Time (s) Taken by GridMix3 (lower is better) In the course of our study, we also compared the performance of SF-CFS with using the SWIM [9] benchmark by running Facebook job traces and have observed SF-CFS to perform better or at par with. SWIM contains suites of workloads of thousands of jobs, with complex data, arrival, and computation patterns which enables rigorous performance measurement of Map/Reduce systems. VI. ADDITIONAL CONSIDERATIONS In addition to comparable performance exhibited by SF-CFS for various Map/Reduce workloads and applications, SF-CFS provides the benefits of being a robust, stable and highly reliable file system. It gives the ability run analytics on top of existing data using existing analytics tools and applications, which eliminates the need for copy-in and copy-out of data from a Hadoop cluster, saving significant amount of time. SF-CFS also supports data ingestion over NFS. Along with all these, it brings in other standard features like snapshot, compression, file level replication and de-duplication etc. For example, gzip compression for the input splits with is not possible as it is impossible to start reading at an arbitrary point in a gzip stream, and a map task can t read its split independently of the others [8]. However, if compression is enabled in SF-CFS, the file system will perform the decompression and return the data to applications in a transparent way. Data backups, disaster recovery are other built in benefits of using SF-CFS for big data analytics. SF-CFS solution for Hadoop, also known as Symantec Enterprise Solution for Hadoop is available as a free download for SF-CFS customers of Symantec Corporation [10]. VII. FUTURE WORK During our study of the performance exhibited by a commercial cluster file system in Map/Reduce workloads and its comparison with a distributed file system, we observed that a significant amount of time is spent during the copy phase of the Map/Reduce model after map task finishes. In Hadoop platform, the input and output data of Map/Reduce jobs are stored in, with intermediate data generated by Map tasks are stored in the local file system of the Mapper nodes and are copied (shuffled) via HTTP to Reducer nodes. The time taken to copy this intermediate map outputs increase proportionately to the size of the data. However, since in case of a clustered file system, all the nodes see all the data, this copy phase can be avoided by keeping the intermediate data in the clustered file system as well and directly reading it from there by the reducer nodes. This endeavor will completely eliminate the copy phase after map is over and bound to give a significant boost to overall performance of Map/Reduce jobs. This will require changes in the logic and code of Map/Reduce framework implemented inside Apache Hadoop and we keep it as a future work for us. VIII. CONCLUSIONS From all the performance benchmark numbers and their analysis, it can be confidently reasoned that for Big Data analytics need, traditional shared storage model cannot be totally ruled out. While due to architectural and design issues, a cluster file system may not scale at the same rate as a shared nothing model does, but for use cases where internet order scalability is not required, a clustered file system can do a decent job even in the Big Data analytics domain. A clustered file system like SF-CFS can provide numerous other benefits with its plethora of features. This decade has seen the success of virtualization which introduced the recent trends of server consolidation, green computing initiatives in enterprises. Big data analytics with a clustered file system from the existing infrastructure aligns into this model and direction. A careful study of the need and use cases is required before building a Big Data analytics platform, rather than going with the notion that shared nothing model is the only answer to Big Data needs. ACKNOWLEDGEMENTS We would like to thank the anonymous reviewers at Symantec for their feedback on this work; Niranjan Pendharkar, Mayuresh Kulkarni and Yatish Jain contributed to the early design of the clustered file system connector module for Apache Hadoop platform. REFERENCES [1] Apache Hadoop. hadoop.apache.org [2] K. Shvachko, Hairong Kuang, S. Radia and R Chansler The Hadoop Distributed File System. Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium 3-7 May 2010 [3] Powered by Hadoop. [4] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Sixth Symposium on Operating System Design and Implementation, pages , December 2004 [5] Rajagopal Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu Pucha,Prasenjit Sarkar, Mansi Shah, Renu Tewari, IBM Research
6 Cloud Analytics: Do We Really Need to Reinvent the Storage Stack? USENIX: HotCloud '09 [6] Apache Hadoop File System APIs. hadoop.apache.org/common/docs/current/api/org/apache/hadoo p/fs/filesystem.html [7] GridMix3. developer.yahoo.com/blogs/hadoop/posts/2010/04/gridmix3_e mulating_production/ [8] Hadoop: The Definitive Guide, Third Edition by Tom White (O Reilly ) [9] SWIMProjectUCB. github.com/swimprojectucb/swim/wiki [10] Symantec Enterprise Solution for Hadoop symantec.com/enterprise-solution-for-hadoop
Big Data Analytics processing with Apache Hadoop storage
Big Data Analytics processing with Apache Hadoop storage R.Gayathri 1, M.BalaAnand 2 1 Assistant Professor, Computer Science and Engineering, V.R.S. College of Engineering and Technology, Tamil Nadu, India
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Storage Architectures for Big Data in the Cloud
Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013 Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications
Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce
HDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director [email protected] Dave Smelker, Managing Principal [email protected]
Processing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
Performance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
Introduction to Gluster. Versions 3.0.x
Introduction to Gluster Versions 3.0.x Table of Contents Table of Contents... 2 Overview... 3 Gluster File System... 3 Gluster Storage Platform... 3 No metadata with the Elastic Hash Algorithm... 4 A Gluster
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
VMware Virtual Machine File System: Technical Overview and Best Practices
VMware Virtual Machine File System: Technical Overview and Best Practices A VMware Technical White Paper Version 1.0. VMware Virtual Machine File System: Technical Overview and Best Practices Paper Number:
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction
Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction There are tectonic changes to storage technology that the IT industry hasn t seen for many years. Storage has been
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
With Red Hat Enterprise Virtualization, you can: Take advantage of existing people skills and investments
RED HAT ENTERPRISE VIRTUALIZATION DATASHEET RED HAT ENTERPRISE VIRTUALIZATION AT A GLANCE Provides a complete end-toend enterprise virtualization solution for servers and desktop Provides an on-ramp to
Whitepaper. NexentaConnect for VMware Virtual SAN. Full Featured File services for Virtual SAN
Whitepaper NexentaConnect for VMware Virtual SAN Full Featured File services for Virtual SAN Table of Contents Introduction... 1 Next Generation Storage and Compute... 1 VMware Virtual SAN... 2 Highlights
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1
Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack HIGHLIGHTS Real-Time Results Elasticsearch on Cisco UCS enables a deeper
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
PARALLELS CLOUD STORAGE
PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
High Availability with Windows Server 2012 Release Candidate
High Availability with Windows Server 2012 Release Candidate Windows Server 2012 Release Candidate (RC) delivers innovative new capabilities that enable you to build dynamic storage and availability solutions
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Microsoft Private Cloud Fast Track
Microsoft Private Cloud Fast Track Microsoft Private Cloud Fast Track is a reference architecture designed to help build private clouds by combining Microsoft software with Nutanix technology to decrease
GeoGrid Project and Experiences with Hadoop
GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
OnX Big Data Reference Architecture
OnX Big Data Reference Architecture Knowledge is Power when it comes to Business Strategy The business landscape of decision-making is converging during a period in which: > Data is considered by most
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
A Survey of Shared File Systems
Technical Paper A Survey of Shared File Systems Determining the Best Choice for your Distributed Applications A Survey of Shared File Systems A Survey of Shared File Systems Table of Contents Introduction...
HadoopTM Analytics DDN
DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate
Scala Storage Scale-Out Clustered Storage White Paper
White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current
Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division
Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division In this talk Big data storage: Current trends Issues with current storage options Evolution of storage to support big
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Networking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
TECHNICAL PAPER. Veeam Backup & Replication with Nimble Storage
TECHNICAL PAPER Veeam Backup & Replication with Nimble Storage Document Revision Date Revision Description (author) 11/26/2014 1. 0 Draft release (Bill Roth) 12/23/2014 1.1 Draft update (Bill Roth) 2/20/2015
I/O Considerations in Big Data Analytics
Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA Outline
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research
Introduction to Cloud : Cloud and Cloud Storage Lecture 2 Dr. Dalit Naor IBM Haifa Research Storage Systems 1 Advanced Topics in Storage Systems for Big Data - Spring 2014, Tel-Aviv University http://www.eng.tau.ac.il/semcom
Introduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Nutanix NOS 4.0 vs. Scale Computing HC3
Nutanix NOS 4.0 vs. Scale Computing HC3 HC3 Nutanix Integrated / Included Hypervisor Software! requires separate hypervisor licensing, install, configuration, support, updates Shared Storage benefits w/o
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
The Methodology Behind the Dell SQL Server Advisor Tool
The Methodology Behind the Dell SQL Server Advisor Tool Database Solutions Engineering By Phani MV Dell Product Group October 2009 Executive Summary The Dell SQL Server Advisor is intended to perform capacity
Big data management with IBM General Parallel File System
Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
1. Comments on reviews a. Need to avoid just summarizing web page asks you for:
1. Comments on reviews a. Need to avoid just summarizing web page asks you for: i. A one or two sentence summary of the paper ii. A description of the problem they were trying to solve iii. A summary of
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
Understanding Hadoop Performance on Lustre
Understanding Hadoop Performance on Lustre Stephen Skory, PhD Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference 15
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
Towards MapReduce Performance Optimization: A Look into the Optimization Techniques in Apache Hadoop for BigData Analytics
Towards MapReduce Performance Optimization: A Look into the Optimization Techniques in Apache Hadoop for BigData Analytics Kudakwashe Zvarevashe 1, Dr. A Vinaya Babu 2 1 M Tech Student, Dept of CSE, Jawaharlal
