The Greenplum Analytics Workbench External Overview 1
The Greenplum Analytics Workbench Definition Is a 1000-node Hadoop Cluster. Pre-configured with publicly available data sets. Contains the entire Hadoop stack consisting of HDFS, PIG, HIVE, HBase, Mahout Provides a mixed mode environment 2
The Greenplum Analytics Workbench A Collaborative Project Developed as a partnership with multiple industry leaders - Intel - Mellanox - Micron - Seagate - Super Micro - Switch - VMWare Built to provide large scale test bed for driving Hadoop innovation 3
Use Cases Data Scientist and Analysts Capability to run ground breaking analytics on large scale data sets. Mixed mode environment - Structured and unstructured data. Analytics tools, HBase, Hive, PIG pre-installed. Analyzing IMDB dataset with Hive QL and Pig Latin Which genre of movies is most common??? Answer: Short movies, followed by Drama s and Comedies Hive Query: Select genres.genre,count (genres.title) as total from genres group by genres.genre order by total; 4
Use Cases University Collaboration Problem solving on large scale cluster. Find solutions to unsolvable problems. Mahout Machine learning library advancing the field of bioinformatics Climate Simulation High Resolution Imaging Medical Research 5
Use Cases Non Academia Test bed driving innovation. Platform for large scale validation. Accelerating development of Hadoop. 6
Use Cases Partners Provides a collaboration platform to share cutting edge technology. Large scale benchmarking platform. Capability to switch Hadoop version within hours. 7
Partners Intel Contributed 2,000 6-core CPUs. Mellanox Contributed >1,000 network cards and 72 switches. Micron Contributed 6,000 8GB DRAM modules. Seagate Contributed 12,000 2TB Drives 8
Partners Super Micro Contributed 1,010 servers (chassis w/ motherboard). Switch Contributed the hosting facilities in its state-of-the-art data center. Vmware Contributed the operational support (from Mozy/ Rubicon). 9
Infrastructure 10
Cluster size Physical Hosts - More than 1,000 nodes With the use of VMs : > 10,000 Racks - 54 (50 just for the DataNodes) Processors - Over 24,000 CPU s RAM Over 48TB for memory Disk capacity More than 24PB of raw storage. Equivalent to nearly half of the entire written works of mankind from the beginning of recorded history The largest testbed cluster for Apache Hadoop validation! 11
Infiniband Network 5:1 Network Configuration (56 Gbps in-rack or 11.2 Gbps e2e) 12
Other Hadoop Servers in the cluster NameNode & SecondaryNameNode Master node for HDFS; maintains the FS Image JobTracker Master node for Map-Reduce HBase Master Used for Hbase to monitor all RegionServer instances in the cluster and manage all metadata changes. Zookeeper hosts Used for centralized configuration maintenance (needed by HBase) Hive Master Runs the servers and metastore required by Hive 13
Other Servers Data Ingestion Hosts Used mainly as a staging area for loading the data into the Hadoop cluster Access or Gateway Hosts The servers from which users will be able to run their jobs NAS Shared storage where users can store their home directories Workbench Management Server to host the UI Jenkins This is our continuous build environment Other management hosts YUM Repo Puppet server Kerberos Ganglia Nagios DNS DHCP NTP Kickstart Firewall SOCKs Proxy 14
Customer Onboarding Phases Discovery Selection Review Provisioning Operations Closing a Project 15
Customer Onboarding - Discovery Project Overview Scope Functional and Technical Requirements Objectives Timeline 16
Customer Onboarding Selection Review Reviewed by JEDI Council. Technical feasibility. Alignment with Greenplum vision. Resource availability. 17
Customer Onboarding Provisioning Users must register on AWB portal. Accounts on access and dil servers only. No shell access to 1000 data nodes or master node. No SUDO permission. Cluster is multi tenant. 18
Customer Onboarding Operations Cluster is not supported 24x7. Custom deployments require packages in RPM format. Custom deployments will take time, start early. 19
Customer Onboarding Closing a Project Default project duration is 90 days. Accounts will be revoked upon project completion. 20
How to submit your project proposal Review Customer Onboarding documentation and fill out EULA @ www.analyticsworkbench.com Fill out Project Consideration request form @ www.analyticsworkbench.com Determination of project selection will be made within 2 weeks of submitting consideration request 21
THANK YOU 22