Introduction. Various user groups requiring Hadoop, each with its own diverse needs, include:
|
|
|
- Jeffrey Bruce
- 10 years ago
- Views:
Transcription
1
2 Introduction BIG DATA is a term that s been buzzing around a lot lately, and its use is a trend that s been increasing at a steady pace over the past few years. It s quite likely you ve also encountered the term Hadoop when hearing about Big Data. Apache Hadoop is an open source framework for storing extremely large amounts of data. In the last 2 3 years, many big players in the industry have developed their own Hadoop distributions including Intel, Microsoft, IBM and EMC. Startups that focus only on Hadoop, such as Cloudera and Hortonworks, have now grown to be big players, too. The beauty of Hadoop distributions is that they can be customized with a wide range of feature sets that address the specific needs of different sets of users. When choosing where to spend its money, it s essential that a company find a distribution that s flexible enough to meet both current and future needs. Various user groups requiring Hadoop, each with its own diverse needs, include: Upper management in large companies wanting to adopt Big Data solutions. Developers wanting to build tools for the Hadoop Ecosystem. Newbies learning Hadoop for the first time and wanting a temporary or non-serious Hadoop deployment.
3 For these and other types of users, we ve studied, analyzed and thoroughly compared the following distribution sources: 1. Intel Distribution for Apache Hadoop (IDH) 2.Cloudera Distribution Including Apache Hadoop (CDH) 3. Hortonworks Data Platform (HDP), and 4. MapR. In this paper, we ll share our experiences with each distributor, and provide both objective and subjective assessments of their features and their performance measures. It will help you to select the distributor that will best serve your specific user requirements. Visit this link for a detailed comparison table of the Hadoop Distributions.
4 1.1 Methodology For features comparison, each of these distributions was installed on AWS EC2 Instances (5-node cluster). We used Intel s HiBench benchmarking utility to make performance comparisons Intel HiBench HiBench, developed and open sourced by Intel, is a benchmarking suite specifically developed for Hadoop deployments. Read more about Intel HiBench and each of its benchmark tests here. The HiBench benchmarks used for this study include: 1. Sort This workload sorts input data generated by the Apache HadoopRandomTextWriter. It s representative of real-world MapReduce jobs that transform data from one format to another. 2. WordCount This workload counts each word s occurrence within input data generated by the Apache Hadoop RandomTextWriter. It s representative of real-world MapReduce jobs that extract small amounts of key data from large data sets. 3. PageRank This workload is an open-source implementation of the page-rank algorithm, a link-analysis algorithm used widely in web search engines. 4. Mahout Bayesian Classification This is the typical application area of MapReduce for large-scale data mining and machine learning, e.g., in Google and Facebook platforms. Bayesian is a well-known classification algorithm for knowledge discovery and data mining, and this workload tests the naive Bayesian trainer in the Mahout open-source machine-learning library from Apache. 5. Hive Join This workload models complex analytic queries of structured (relational) tables by computing the sum of each group over a single read-only table. 6. Hive Aggregate This workload models complex analytic queries of structured (relational) tables by computing both the average and the sum for each group by joining two different tables. 7. Enhanced DFS IO Tests HDFS system throughput of a Hadoop cluster. Computes the aggregated bandwidth by sampling the number of bytes read or written at fixed time intervals in each map task.
5 Each of these benchmarks was run three times with each distribution. We used the average values to compile our performance comparison graphs and reports EC2 Hardware and Hadoop Configuration The Amazon EC2 Instances used for this study had the following configuration: EC2 H/W: Instance Type Arch. vcpu ECU Memory (GiB) Storage (GB) Physical Cores (Per Instance) 1-Master m1.xlarge 64-bit x x Intel(R) Xeon(R) CPU E GHz 4-Slaves m1.large 64-bit x x Intel(R) Xeon(R) CPU 2.66GHz Each Hadoop Cluster had the following configuration: Hadoop Cluster Configuration: HDFS Capacity NameNode 1 Secondary NameNode 1 DataNodes 5 JobTracker 1 TaskTracker 5 Notes ~5 TB Five EC2 instances of this configuration were procured once. A 5-node cluster for each installed Hadoop Distribution was formed, studied and benchmarked. Each distribution was then uninstalled, followed by installation of a new distribution to be studied and benchmarked. EC2 instances may be shared on the same hardware as other users on an AWS Cloud, however, execution times may be affected at peak hours of load on AWS EC2. To minimize these errors, we benchmarked each distribution three times. The final result is the average of the three tests. All distributions were benchmarked with out-of-the-box default Hadoop configuration settings.
6 1.2 Terminologies Used The following terminologies will be used throughout this paper. IDH: Intel Distribution for Apache Hadoop CDH: Cloudera Distribution Including Apache Hadoop HDP: Hortonworks Data Platform 2.0 Intel Distribution for Apache Hadoop Of the Hadoop distributors mentioned here, Intel entered the market last. Based on its general features, Intel is trying hard to stay competitive and surpass other distributions by providing more and more features. However, IDH seems a bit embryonic, especially the Intel Manager, as the user experience isn t very fluid, especially compared to Cloudera Manager, and it includes some glitches. The glitches are, however, negligible since it performs better than the others. IDH s interface is very easy to navigate, and its search facility is also quite easy to use. The interface is definitely better than Ambari, a management console used by HDP. IDH supports most of the tools and components in the Apache Hadoop ecosystem, such as Hive and HBase. Figure 1 Intel Manager Dashboard 2.1 Key Features Intel appears to be focusing more heavily on security aspects than on performance. Its distribution is rich with security features. This is the only distribution that supports HDFS Data encryption. Intel claims that these encryptions and decryptions have been optimized by ~14x using AES-NI instructions on certain processors from the Xeon series.
7 Other notable features include: MapReduce Application Management. Active Tuner, a tool that finds the best set of configurations for a MapReduce application. Node and service-level memory management directly from the manager. Other distros also support service-level memory management, but the services supported are limited. The standard version of IDH comes with limited features when compared to its commercial version. 2.2 Benchmarks Despite the performance optimizations claimed by Intel, IDH finished last in each benchmark. HDFS Read/Write throughputs were also a bit lower than the other distributions. Its low performance may be related to the fact that we didn t use the same hardware upon which Intel claimed performance optimizations. Nonetheless, it performed slowly on the same hardware configuration we used to benchmark the other distributions. We hope Intel will implement improvements in the future. 3.0 Cloudera Distribution Including Apache Hadoop Cloudera is the oldest and most mature of all of the distributions we compared. It tops the list when it comes to introducing innovative new tools into the Hadoop ecosystem, Cloudera quickly adapts to new Hadoop releases and implements bug fixes in its distribution, which is why it s one of the leaders in Apache Hadoop project implementation. Cloudera distribution s management console, Cloudera Manager, is smooth to implement, intuitive to use and features a rich user interface that uses all available screen space to display useful administrator information in a clean, uncluttered manner.
8 Figure 2 Cloudera Manager Dashboard 3.1 Key Features This is the only distribution that supports: Multi Cluster management. Node Templates. Groups of nodes can be created in a cluster with different sets of configurations for different groups, rather than having to use the same configuration throughout the cluster. This is particularly useful in the case of heterogeneous nodes. Adding new services to an already running cluster. 3.2 Benchmarks CDH performed well in our benchmarks, standing between IDH and MapR. However, the performance gap between CDH and MapR is pretty wide. The results for CDH and HDP were nearly identical. 4.0 Hortonworks Data Platform Though HDP s core tools, such as Hadoop and Hive, appear stable, its Ambari management console is more primitive than the other distro management consoles. Apache s project for Hadoop Cluster Management lacks many of the features found in other distros. Hortonworks/Apache needs to make considerable improvements on this front.
9 Figure 3 Ambari Dashboard 4.1 Key Features HDP is the only distribution that supports the Windows platform, utilizing Windows Server for on-premise and Windows Azure in the cloud. With implementation of the Stinger project that s now under development, Hortonworks plans to make Hive 100x faster. 4.2 Benchmarks As mentioned before, performance for HDP and CDH was nearly the same in our benchmarks. They performed better than IDH, but worse than MapR. 5.0 MapR The groundbreaking MapR is the most innovative of the four distributions that we ve tested. The company has significantly modified the MapReduce framework architecture and implemented its own file system, called MapR- FS. The performance gains of these modifications aren t only theoretical, as MapR performed ~2x times faster in several of our benchmarks than the other three distros.
10 Figure 4 MapR-FS Simplified Architecture Figure 5MapR Control Center Dashboard MapR s management console isn t as feature rich as Cloudera Manager. 5.1 Key Features No NameNode architecture. Multi Node Direct Access NFS, which allows users to mount MapR-FS over NFS and access it just like a normal file system. This allows any application to access Hadoop data in the traditional manner.
11 Amazon EMR and Google Cloud Platform use MapR for their Hadoop services. 5.2 Benchmarks MapR performed fastest of the four distributions in our benchmark tests. The file system throughput, both read and write, was very high, so if performance is one s primary requirement, then MapR is the clear distribution choice. 6 Summary Take a look at the detailed comparison report here. Pros Cons CDH 1. Cloudera Manager offers the most feature-rich interface of all. 2. Bundles many useful tools, such as Cloudera Impala. 1. Slower than MapR IDH 1. Rich in security features. 1. Slowest performer in our benchmarks. HDP 1. The only distribution supporting the Windows platform. 1. The Ambari management interface is primitive and lacks many useful features. MapR 1. Fastest of the four tested distros. 2. Multi-node direct access NFS allows mounting Hadoop FS over NFS, so applications can directly access data like a normal file system. 1. The MapR Management Interface isn't as rich as the Cloudera Manager.
12 Figure 6 Micro, Web Search, M/C Learning and Data Analytics Benchmarks Figure 7 File System Benchmarks
13 APPENDIX - I HiBench Installation and Configuration The following are the steps to install and run a benchmark on a Hadoop Distribution using HiBench: Configuration 1. Setup HiBench Download HiBench-2.2 benchmark suite from 2. Setup Hadoop Before running any workload from HiBench, verify that your Hadoop framework is running correctly. 3. Setup Hive (for running Hive benchmarks) Be sure to set up Hive properly in your cluster if you want to test hivebench. If you don t, the benchmark will use the default Hive-0.9 release that s included in the package. 4. Configuration for the all workloads Set some global environment variables in the bin/hibench-config.sh file located in the root dir. Example, HADOOP_HOME <The Hadoop installation location> HADOOP_CONF_DIR <The Hadoop configuration DIR, default is $HADOOP_HOME/conf> COMPRESS_GLOBAL <Whether to enable the in/out compression for all workloads, 0 is disable, 1 is enable> COMPRESS_CODEC_GLOBAL <The default codec used for in/out data compression> 5. Configure each workload You can modify the conf/configure.sh file under each workload folder if it exists. All data sizes and options related to the workload are defined in this file. 6. Synchronize the time on all nodes This is required for dfsioe, and optional for others Running 1. Run several workloads together The conf/benchmarks.lst file under the package folder defines the workloads to run when you execute the bin/run-all.sh script under the package folder. Each line in the list file specifies one workload. You can use # at the beginning of each line to skip the corresponding benchmark, if necessary.
14 2. Run each workload separately Each workload can be run separately. There are 3 different files under one workload folder. conf/configure.sh options. bin/prepare*.sh bin/run*.sh Configuration file contains all parameters such as data size and test Generate or copy the job input data into HDFS. Execute the workload 3. Configure the benchmark Set your own configurations by modifying configure.sh if necessary. 4. Prepare data bin/prepare.sh (bin/prepare-read.sh for dfsioe) to prepare input data in HDFS for running the benchmark 5. Run the benchmark bin/run*.sh to run the corresponding benchmark.
15 APPENDIX - II Intel Installation Installation is a piece of cake with IDH, once all the prerequisites are met. Most of the configuration steps were automatic when adding new nodes. Regarding uninstallation, IDH is the only distribution that provides an uninstaller. With other distros, uninstallation and cleanup must be done manually.
16 APPENDIX - III CDH Installation Installation is very easy. A particularly useful feature of CDH is special support for installation on AWS EC2 service. The installer detects that it s being run on an EC2 node and provides an option to follow different installation steps. These steps automatically create other EC2 nodes. All one then needs to do is to specify the count, instance type, and OS. This saves a lot of effort on EC2 when creating a large cluster, since only one installation node needs to be created. The other nodes are created and configured automatically. With other distros, one must configure each node individually and manually. More information on this feature may be found here.
17 APPENDIX - IV HDP Installation Installation is a bit cumbersome with HDP. One needs to perform some steps manually on each node. While these steps are not many and are not difficult, they require more effort than IDH or CDH.
18 APPENDIX - IV MapR Installation Like HDP, installation is a bit cumbersome, but the installation guide provides clear, details steps to follow. For further details contact Aater Suleman, CEO [email protected] Ali Hussain, CTO [email protected] About Flux7 Labs Flux7 Labs offers expert advisory consulting and implementation services for optimization and management of computer systems. The principals of Flux7 have over two decades of combined experience with complex computer systems. We are based in Austin, Texas, and comprise a healthy mix of passionate geeks and dedicated experts in advanced computing. We have presented our consulting and training services to Bank of America, Sony and M.D. Anderson Cancer Center, among others. Headquarters Rosethorn Dr, Austin, Texas, 78758, United States
HiBench Installation. Sunil Raiyani, Jayam Modi
HiBench Installation Sunil Raiyani, Jayam Modi Last Updated: May 23, 2014 CONTENTS Contents 1 Introduction 1 2 Installation 1 3 HiBench Benchmarks[3] 1 3.1 Micro Benchmarks..............................
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software Engineer, @MirantisIT
Hadoop on OpenStack Cloud Dmitry Mescheryakov Software Engineer, @MirantisIT Agenda OpenStack Sahara Demo Hadoop Performance on Cloud Conclusion OpenStack Open source cloud computing platform 17,209 commits
Dell Reference Configuration for Hortonworks Data Platform
Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution
Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide
OPTIMIZATION AND TUNING GUIDE Intel Distribution for Apache Hadoop* Software Intel Distribution for Apache Hadoop* Software: Optimization and Tuning Guide Configuring and managing your Hadoop* environment
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
H2O on Hadoop. September 30, 2014. www.0xdata.com
H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure
Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure The Intel Distribution for Apache Hadoop* software running on 808 VMs using VMware vsphere Big Data Extensions and Dell
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
The Inside Scoop on Hadoop
The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. [email protected] [email protected] @OrionGM The Inside Scoop
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
... ... PEPPERDATA OVERVIEW AND DIFFERENTIATORS ... ... ... ... ...
..................................... WHITEPAPER PEPPERDATA OVERVIEW AND DIFFERENTIATORS INTRODUCTION Prospective customers will often pose the question, How is Pepperdata different from tools like Ganglia,
Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Big Data and Industrial Internet
Big Data and Industrial Internet Keijo Heljanko Department of Computer Science and Helsinki Institute for Information Technology HIIT School of Science, Aalto University [email protected] 16.6-2015
Savanna Hadoop on. OpenStack. Savanna Technical Lead
Savanna Hadoop on OpenStack Sergey Lukjanov Savanna Technical Lead Mirantis, 2013 Agenda Savanna Overview Savanna Use Cases Roadmap & Current Status Architecture & Features Overview Hadoop vs. Virtualization
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Case Study : 3 different hadoop cluster deployments
Case Study : 3 different hadoop cluster deployments Lee moon soo [email protected] HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer
Performance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
MapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: [email protected] & [email protected] Abstract : In the information industry,
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
<Insert Picture Here> Big Data
Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
Ubuntu and Hadoop: the perfect match
WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely
Apache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
Modernizing Your Data Warehouse for Hadoop
Modernizing Your Data Warehouse for Hadoop Big data. Small data. All data. Audie Wright, DW & Big Data Specialist [email protected] O 425-538-0044, C 303-324-2860 Unlock Insights on Any Data Taking
Lessons Learned: Building a Big Data Research and Education Infrastructure
Lessons Learned: Building a Big Data Research and Education Infrastructure G. Hsieh, R. Sye, S. Vincent and W. Hendricks Department of Computer Science, Norfolk State University, Norfolk, Virginia, USA
Big Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster
Integrating SAP BusinessObjects with Hadoop Using a multi-node Hadoop Cluster May 17, 2013 SAP BO HADOOP INTEGRATION Contents 1. Installing a Single Node Hadoop Server... 2 2. Configuring a Multi-Node
Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload
Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Drive operational efficiency and lower data transformation costs with a Reference Architecture for an end-to-end optimization and offload
Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp
Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp Introduction to Hadoop Comes from Internet companies Emerging big data storage and analytics platform HDFS and MapReduce
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data
Hive vs. JavaScript for Processing Big Data For some time Microsoft didn t offer a solution for processing big data in cloud environments. SQL Server is good for storage, but its ability to analyze terabytes
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Adobe Deploys Hadoop as a Service on VMware vsphere
Adobe Deploys Hadoop as a Service A TECHNICAL CASE STUDY APRIL 2015 Table of Contents A Technical Case Study.... 3 Background... 3 Why Virtualize Hadoop on vsphere?.... 3 The Adobe Marketing Cloud and
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from
Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')
Enabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
Benchmarking Sahara-based Big-Data-as-a-Service Solutions. Zhidong Yu, Weiting Chen (Intel) Matthew Farrellee (Red Hat) May 2015
Benchmarking Sahara-based Big-Data-as-a-Service Solutions Zhidong Yu, Weiting Chen (Intel) Matthew Farrellee (Red Hat) May 2015 Agenda o Why Sahara o Sahara introduction o Deployment considerations o Performance
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
MapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially
IT@Intel How Intel IT Successfully Migrated to Cloudera Apache Hadoop*
White Paper April 2015 IT@Intel How Intel IT Successfully Migrated to Cloudera Apache Hadoop* From our original experience with Apache Hadoop software, Intel IT identified new opportunities to reduce IT
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Copyright 2012, Oracle and/or its affiliates. All rights reserved.
1 Oracle Big Data Appliance Releases 2.5 and 3.0 Ralf Lange Global ISV & OEM Sales Agenda Quick Overview on BDA and its Positioning Product Details and Updates Security and Encryption New Hadoop Versions
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Organizations are flooded with data. Not only that, but in an era of incredibly
Chapter 1 Introducing Hadoop and Seeing What It s Good For In This Chapter Seeing how Hadoop fills a need Digging (a bit) into Hadoop s history Getting Hadoop for yourself Looking at Hadoop application
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,
IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud
IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud February 25, 2014 1 Agenda v Mapping clients needs to cloud technologies v Addressing your pain
Setup Guide for HDP Developer: Storm. Revision 1 Hortonworks University
Setup Guide for HDP Developer: Storm Revision 1 Hortonworks University Overview The Hortonworks Training Course that you are attending is taught using a virtual machine (VM) for the lab environment. Before
Big Data and Natural Language: Extracting Insight From Text
An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5
Getting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
Accelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera
Accelerating Enterprise Big Data Success Tim Stevens, VP of Business and Corporate Development Cloudera 1 Big Opportunity: Extract value from data Revenue Growth x = 50 Billion 35 ZB Cost Savings Margin
Hadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Cloudera Manager Training: Hands-On Exercises
201408 Cloudera Manager Training: Hands-On Exercises General Notes... 2 In- Class Preparation: Accessing Your Cluster... 3 Self- Study Preparation: Creating Your Cluster... 4 Hands- On Exercise: Working
Data Analyst Program- 0 to 100
Development Data Analyst Program- 0 to 100 Master the Data Analysis tools like Pig and hive Data Science Build a recommendation engine 1 Data Analyst Program- 0 to 100 HADOOP SCHOOL OF TRAINING Basics
Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe 20-22 May, 2013
Dubrovnik, Croatia, South East Europe 20-22 May, 2013 Big Data Value, use cases and architectures Petar Torre Lead Architect Service Provider Group 2011 2013 Cisco and/or its affiliates. All rights reserved.
HadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
The Greenplum Analytics Workbench
The Greenplum Analytics Workbench External Overview 1 The Greenplum Analytics Workbench Definition Is a 1000-node Hadoop Cluster. Pre-configured with publicly available data sets. Contains the entire Hadoop
How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (
Apache Hadoop 1.0 High Availability Solution on VMware vsphere TM Reference Architecture TECHNICAL WHITE PAPER v 1.0 June 2012 Table of Contents Executive Summary... 3 Introduction... 3 Terminology...
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
Apache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer [email protected], twitter: @awadallah Hadoop Past
MapR, Hive, and Pig on Google Compute Engine
White Paper MapR, Hive, and Pig on Google Compute Engine Bring Your MapR-based Infrastructure to Google Compute Engine MapR Technologies, Inc. www.mapr.com MapR, Hive, and Pig on Google Compute Engine
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.
Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework
Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)
+ Hadoop-based Open Source ediscovery: FreeEed (Easy as popcorn) + Hello! 2 Sujee Maniyam & Mark Kerzner Founders @ Elephant Scale consulting and training around Hadoop, Big Data technologies Enterprise
QlikView, Creating Business Discovery Application using HDP V1.0 March 13, 2014
QlikView, Creating Business Discovery Application using HDP V1.0 March 13, 2014 Introduction Summary Welcome to the QlikView (Business Discovery Tools) tutorials developed by Qlik. The tutorials will is
Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
Hadoop Installation MapReduce Examples Jake Karnes
Big Data Management Hadoop Installation MapReduce Examples Jake Karnes These slides are based on materials / slides from Cloudera.com Amazon.com Prof. P. Zadrozny's Slides Prerequistes You must have an
Fast, Low-Overhead Encryption for Apache Hadoop*
Fast, Low-Overhead Encryption for Apache Hadoop* Solution Brief Intel Xeon Processors Intel Advanced Encryption Standard New Instructions (Intel AES-NI) The Intel Distribution for Apache Hadoop* software
Platfora Deployment Planning Guide
Platfora Deployment Planning Guide Version 5.3 Copyright Platfora 2016 Last Updated: 5:30 p.m. June 27, 2016 Contents Document Conventions... 4 Contact Platfora Support...5 Copyright Notices... 5 Chapter
Bright Cluster Manager
Bright Cluster Manager A Unified Management Solution for HPC and Hadoop Martijn de Vries CTO Introduction Architecture Bright Cluster CMDaemon Cluster Management GUI Cluster Management Shell SOAP/ JSONAPI
Can High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
Data movement for globally deployed Big Data Hadoop architectures
Data movement for globally deployed Big Data Hadoop architectures Scott Rudenstein VP Technical Services November 2015 WANdisco Background WANdisco: Wide Area Network Distributed Computing " Enterprise
StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud
StACC: St Andrews Cloud Computing Co laboratory A Performance Comparison of Clouds Amazon EC2 and Ubuntu Enterprise Cloud Jonathan S Ward StACC (pronounced like 'stack') is a research collaboration launched
BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?
BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand? The Big Data Buzz big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
