The Greenplum Analytics Workbench



Similar documents
Greenplum Analytics Workbench

Communicating with the Elephant in the Data Center

Certified Big Data and Apache Hadoop Developer VS-1221

Deploying Hadoop with Manager

Hadoop: Embracing future hardware

Pivotal HD Enterprise

Dell Reference Configuration for Hortonworks Data Platform

Open source Google-style large scale data analysis with Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hortonworks Data Platform Reference Architecture

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Hadoop Architecture. Part 1

Adobe Deploys Hadoop as a Service on VMware vsphere

Qsoft Inc

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Workshop on Hadoop with Big Data

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

Pivotal HD Enterprise 1.0 Stack and Tool Reference Guide. Rev: A03

Hadoop Big Data for Processing Data and Performing Workload

Chase Wu New Jersey Ins0tute of Technology

VMware vsphere Big Data Extensions Administrator's and User's Guide

THE HADOOP DISTRIBUTED FILE SYSTEM

I/O Considerations in Big Data Analytics

ITG Software Engineering

Benchmarking Hadoop & HBase on Violin

Virtualizing Apache Hadoop. June, 2012

Apache Hadoop: Past, Present, and Future

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

<Insert Picture Here> Big Data

Pivotal HD Enterprise

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Entering the Zettabyte Age Jeffrey Krone

docs.hortonworks.com

Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure

Hadoop Ecosystem B Y R A H I M A.

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Cloudera Administrator Training for Apache Hadoop

Constructing a Data Lake: Hadoop and Oracle Database United!

Apache Hadoop Cluster Configuration Guide

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop implementation of MapReduce computational model. Ján Vaňo

Upcoming Announcements

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

White Paper. Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference Configurations

Extending Hadoop beyond MapReduce

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Enabling High performance Big Data platform with RDMA

MySQL and Hadoop. Percona Live 2014 Chris Schneider

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Case Study : 3 different hadoop cluster deployments

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Oracle Big Data SQL Technical Update

COURSE CONTENT Big Data and Hadoop Training

Hadoop on the Gordon Data Intensive Cluster

HADOOP MOCK TEST HADOOP MOCK TEST I

Open source large scale distributed data management with Google s MapReduce and Bigtable

Large scale processing using Hadoop. Ján Vaňo

How Cisco IT Built Big Data Platform to Transform Data Management

Hadoop & its Usage at Facebook

Ankush Cluster Manager - Hadoop2 Technology User Guide

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

Platfora Deployment Planning Guide

Moving From Hadoop to Spark

BIG DATA HADOOP TRAINING

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Big Data Too Big To Ignore

Hadoop IST 734 SS CHUNG

Dell In-Memory Appliance for Cloudera Enterprise

Use case: Merging heterogeneous network measurement data

Intel Distribution for Apache Hadoop on Dell PowerEdge Servers

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

HADOOP AT NOKIA JOSH DEVINS, NOKIA HADOOP MEETUP, JANUARY 2011 BERLIN

Introduction to Big Data Training

Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage Division

Hadoop & its Usage at Facebook

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

MapR Enterprise Edition & Enterprise Database Edition

Introduction to HDFS. Prasanth Kothuri, CERN

Design and Evolution of the Apache Hadoop File System(HDFS)

Cloud Computing Where ISR Data Will Go for Exploitation

Accelerating and Simplifying Apache

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Transcription:

The Greenplum Analytics Workbench External Overview 1

The Greenplum Analytics Workbench Definition Is a 1000-node Hadoop Cluster. Pre-configured with publicly available data sets. Contains the entire Hadoop stack consisting of HDFS, PIG, HIVE, HBase, Mahout Provides a mixed mode environment 2

The Greenplum Analytics Workbench A Collaborative Project Developed as a partnership with multiple industry leaders - Intel - Mellanox - Micron - Seagate - Super Micro - Switch - VMWare Built to provide large scale test bed for driving Hadoop innovation 3

Use Cases Data Scientist and Analysts Capability to run ground breaking analytics on large scale data sets. Mixed mode environment - Structured and unstructured data. Analytics tools, HBase, Hive, PIG pre-installed. Analyzing IMDB dataset with Hive QL and Pig Latin Which genre of movies is most common??? Answer: Short movies, followed by Drama s and Comedies Hive Query: Select genres.genre,count (genres.title) as total from genres group by genres.genre order by total; 4

Use Cases University Collaboration Problem solving on large scale cluster. Find solutions to unsolvable problems. Mahout Machine learning library advancing the field of bioinformatics Climate Simulation High Resolution Imaging Medical Research 5

Use Cases Non Academia Test bed driving innovation. Platform for large scale validation. Accelerating development of Hadoop. 6

Use Cases Partners Provides a collaboration platform to share cutting edge technology. Large scale benchmarking platform. Capability to switch Hadoop version within hours. 7

Partners Intel Contributed 2,000 6-core CPUs. Mellanox Contributed >1,000 network cards and 72 switches. Micron Contributed 6,000 8GB DRAM modules. Seagate Contributed 12,000 2TB Drives 8

Partners Super Micro Contributed 1,010 servers (chassis w/ motherboard). Switch Contributed the hosting facilities in its state-of-the-art data center. Vmware Contributed the operational support (from Mozy/ Rubicon). 9

Infrastructure 10

Cluster size Physical Hosts - More than 1,000 nodes With the use of VMs : > 10,000 Racks - 54 (50 just for the DataNodes) Processors - Over 24,000 CPU s RAM Over 48TB for memory Disk capacity More than 24PB of raw storage. Equivalent to nearly half of the entire written works of mankind from the beginning of recorded history The largest testbed cluster for Apache Hadoop validation! 11

Infiniband Network 5:1 Network Configuration (56 Gbps in-rack or 11.2 Gbps e2e) 12

Other Hadoop Servers in the cluster NameNode & SecondaryNameNode Master node for HDFS; maintains the FS Image JobTracker Master node for Map-Reduce HBase Master Used for Hbase to monitor all RegionServer instances in the cluster and manage all metadata changes. Zookeeper hosts Used for centralized configuration maintenance (needed by HBase) Hive Master Runs the servers and metastore required by Hive 13

Other Servers Data Ingestion Hosts Used mainly as a staging area for loading the data into the Hadoop cluster Access or Gateway Hosts The servers from which users will be able to run their jobs NAS Shared storage where users can store their home directories Workbench Management Server to host the UI Jenkins This is our continuous build environment Other management hosts YUM Repo Puppet server Kerberos Ganglia Nagios DNS DHCP NTP Kickstart Firewall SOCKs Proxy 14

Customer Onboarding Phases Discovery Selection Review Provisioning Operations Closing a Project 15

Customer Onboarding - Discovery Project Overview Scope Functional and Technical Requirements Objectives Timeline 16

Customer Onboarding Selection Review Reviewed by JEDI Council. Technical feasibility. Alignment with Greenplum vision. Resource availability. 17

Customer Onboarding Provisioning Users must register on AWB portal. Accounts on access and dil servers only. No shell access to 1000 data nodes or master node. No SUDO permission. Cluster is multi tenant. 18

Customer Onboarding Operations Cluster is not supported 24x7. Custom deployments require packages in RPM format. Custom deployments will take time, start early. 19

Customer Onboarding Closing a Project Default project duration is 90 days. Accounts will be revoked upon project completion. 20

How to submit your project proposal Review Customer Onboarding documentation and fill out EULA @ www.analyticsworkbench.com Fill out Project Consideration request form @ www.analyticsworkbench.com Determination of project selection will be made within 2 weeks of submitting consideration request 21

THANK YOU 22