Greenplum Analytics Workbench

Transcription

1 Greenplum Analytics Workbench Abstract This white paper details the way the Greenplum Analytics Workbench was designed and built to validate Apache Hadoop code at scale, as well as provide a large scale experimentation environment for mixed mode development that include various SQL and Non-SQL execution environments. It describes the core architectural components involved as well as highlights the benefits that an Enterprise can leverage to quickly and efficiently analyze Big Data. Table of Contents Introduction Partners Super Micro Micron... 3 Intel Seagate Mellanox Mellanox ConnectX -3 VPI Network Adapter Card Mellanox ConnectX -3 VPI card Specifications Mellanox SwitchX VPI Switches Mellanox SwitchX VPI Switch Specifications Mellanox Cables Mellanox FDR Passive Copper and Optical Cable Mellanox Unstructured Data Accelerator (UDA) Switch... 7 Rubicon Hadoop Software Overview Hadoop MapReduce Hadoop Distributed File System Hadoop Ecosystem Hadoop on the Greenplum Analytics Workbench TeraSort Example About Greenplum White Paper

2 Introduction Enterprises have been dealing with storing rapidly growing amounts of data truly big data available via traditional sources such as ERP, CRM, etc. to now social media, blogs, etc. The initial focus for the enterprise has been on efficient storage of big data but the focus has now shifted to analytics on the big data sets. Hadoop has emerged as the platform of choice for processing big data especially unstructured data. The out of box experience with Hadoop still has a lot to be desired as it lacks tools needed for ease of deployment and management of such an infrastructure; especially one needed to handle larger deployments. Enterprises are also quickly realizing that in order to maximize the analytics capabilities, a mixed mode environment is imperative. A mixed environment is where enterprises have an easy way of combining both the structured data sets (using traditional SQL) and unstructured data sets (using Hadoop/no- SQL) without having to rework existing processes. Greenplum Analytics Workbench is built to provide an environment that supports mixed mode development and validation at scale. The workbench is pre-configured with open and freely available data sets and has analysis software built-in for quick turnaround and rapid productivity. It will contain the entire Hadoop stack consisting of HDFS, M/R, PIG, HIVE, HBase, Mahout and augment it with SQL capabilities of Greenplum Database industry s leading MPP database; all deployed on the same nodes. Analytics Workbench provides a perfect experimentation platform for Greenplum s thought leadership in the Unified Analytics Platform. It also provides a tremendous learning opportunity for organizations that wish to build and operate a large Hadoop/Mixed mode cluster. Partners The Analytics Workbench was assembled with the help of strong partnership with some of the industries leading vendors of hardware and services. The Greenplum team forged close alliances with these partners to carefully assemble the hardware nodes, rack/stack and cable the nodes in a state of the art datacenter. An extremely fast network backbone and switching layer was designed to provide the cluster with blazing fast throughput for intra-cluster communication.

3 Supermicro Supermicro contributed 1,000 enterprise-ready server systems for all data-processing nodes to power the 24 peta bytes of storage available on the cluster. Supermicro fully assembled and tested the data-processing nodes in their Silicon Valley production center. The assembly process included integrating processors, memory, disk drives, and network cards from the other contributing partners. Supermicro s design team optimized the system configuration to address both data center space and power challenges. Supermicro servers maintain peak power efficiencies with platinumlevel 94% efficient power supplies. Supermicro then maximized node count per rack, reducing the overheads for power and cooling expenditure while delivering high performance and 24terabytes of storage requirements per data-processing node. The specifications of the Supermicro systems are as follows: 2U Greenplum Hadoop OEM Server - Model # PIO-626T-6RF-EM09B Supermicro Dual Processor motherboard supporting: - Dual Intel Xeon X5500/X5600 series processors - Up to 192GB RAM with 12 DDR3 RDIMMs - 5 expansion slots: 3 PCI-E 2.0 x8, 1 PCI-E 2.0 x4, 1 PCI-E x4 - Onboard LSI Gbps disk controller (IR mode) - Dual LAN with Intel Gigabit Ethernet Controller - Dedicated IPMI remote management port Supermicro 2U server chassis supporting: - 12 hot-swap 3.5 drive trays - Redundant 500 Watt platinum-level power supplies - 94+% efficiency rating - Active backplane with 6.0Gbps SAS/SATA expander - 7 low-profile PCI-E expansion slots Micron Micron contributed 6000 DDR3 RDIMM memory sticks with 8GB each. A total of 48TB of memory to be evenly distributed across 1000 nodes such that each node has a total of 48GB RAM. DDR3 functionality and operations supported as: 240-pin, registered dual in-line memory module (RDIMM) Fast data transfer rates: PC , PC3-8500, or PC GB (512 Meg x 72) VDD = 1.5V ±0.075V VDDSPD = V Supports ECC error detection and correction Nominal and dynamic on-die termination (ODT) for data, strobe, and mask signals Quad rank On-board I2C temperature sensor with integrated serial presence-detect (SPD) EEPROM Fixed burst chop (BC) of 4 and burst length (BL) of 8 via the mode register set (MRS) Selectable BC4 or BL8 on-the-fly (OTF) Gold edge contacts

4 Intel Intel contributed 2000 Westmere processors with the following specifications Processor Number X5670 # of Cores 6 # of Threads 12 Clock Speed 2.93 GHz Max Turbo Frequency 3.33 GHz Intel Smart Cache 12 MB Bus/Core Ratio 22 Intel QPI Speed 6.4 GT/s # of QPI Links 2 Instruction Set 64-bit Instruction Set Extensions SSE4.2 Embedded Options Available No Lithography 32 nm Max TDP 95 W Max Memory Size (dependent on memory type) 288 GB Memory Types DDR3-800/1066/1333 # of Memory Channels 3 Max Memory Bandwidth 32 GB/s Physical Address Extensions 40-bit ECC Memory Supported Intel Turbo Boost Technology Intel Hyper-Threading Technology Intel Virtualization Technology (VT-x) Intel Virtualization Technology for Directed I/O (VT-d) Intel Trusted Execution Technology AES New Instructions Intel 64 Idle States Enhanced Intel SpeedStep Technology Intel Demand Based Switching Thermal Monitoring Technologies No Execute Disable Bit Table 1: (courtesy:

5 Seagate Seagate contributed 12,000 2TB drives for a total of 24TB per node and 24PB raw storage across the cluster. The specification of the Seagate drives is as follows: Product Name 2TB Constellation ES 7200RPM 3.5 SATA 6GB/s 64MB Cache Hard Drive Product Type Hard Drive Buffer 64 MB Hard Drive Interface SATA/600 Compatible Drive Bay Width 3.5 SATA Pin 7-pin Height 1.0 Width 4.0 Depth 5.8 Product Series ES Form Factor Internal Product Model ST2000NM0011 Product Line Constellation Storage Capacity 2 TB Rotational Speed 7200 rpm Maximum External Data Transfer Rate 600 MBps (4.7 Gbps) Average Latency 4.16 ms Average Seek Time 9.50 ms Table 2: (courtesy: Mellanox Mellanox ConnectX -3 VPI Network Adapter Card Mellanox ConnectX -3 VPI card Specifications Part Number Supported Data Rates InfiniBand Supported Data Rates Ethernet MCX354A-FCBT FDR;QDR;DDR 40GbE;10GbE PCI Express generations support 3.0;2.0; 1.1 RDMA Support Supported Media Types Number of Ports and Types InfiniBand; RoCE Direct Attached Copper; Active Optical Cables, Optical Modules 2 Ports, QSFP+

6 Mellanox SwitchX VPI Switches Mellanox SwitchX VPI Switch Specifications Part Number Supported Data Rates InfiniBand Supported Data Rates Ethernet Port-to-Port Latency InfiniBand Port-to-Port Latency InfiniBand Blocking Ratio Number of Ports and Type Typical Power Consumption Supported Media Types MSX6036F-1SFR FDR;QDR;DDR 40GbE;10GbE 170ns 230ns 1:1 (Fully non-blocking) 36 Ports, QSFP+ 126W Direct Attached Copper; Active Optical Cables, Optical Modules Mellanox Cables Mellanox FDR Passive Copper and Optical Cable Greenplum Analytics Workbench connectivity is enabled by Mellanox s FDR cables. Both Passive Copper and Active Optical Cables are used to provide state-of-the-art cluster cabling solution as well as durability and ease of installation. Mellanox Unstructured Data Accelerator (UDA) Mellanox UDA, a software plugin, accelerates Hadoop network and improves the scaling of Hadoop clusters executing data analytics intensive applications. A novel data moving protocol which uses RDMA in combination with an efficient merge-sort algorithm enables Hadoop clusters based on Mellanox InfiniBand and 40/10GbE RoCE (RDMA over Converged Ethernet) adapter cards to efficiently move data between servers accelerating the Hadoop framework. The 1000 node Hadoop cluster is connected via blazing fast FDR, 56Gbps InfiniBand interconnect, using ConnectX -3 VPI cards and SwitchX VPI switches, as described above. The cluster is using 3 layers of switching: Node level switches that connect 20 servers in each rack to the aggregation layer using 4 FDR uplinks from each Top of Rack Switch (ToR). Aggregation layer switches that are connected to the core layer using 18 uplinks delivering full non-blocking InfiniBand network between the aggregation and core levels and the core level switches. The cluster utilizes IP over IB protocol to enable a more efficient connection to socket based portions of the framework. UDA provides the MapReduce portion of the framework the ability to utilize RDMA connectivity between nodes, reducing CPU overhead and enabling lower latency connections. The outcome of RDMA usage is significant reduction of processing time for similar size of datasets.

7 The interconnect layout of the network is as follows: Switch Switch is the state of the art datacenter in Las Vegas, NV where the Analytics Workbench is hosted. There are almost 3 full SCIFS consisting of 54 racks in all. Each rack holds 20 servers. A few racks are not completely filled leaving a bit of a room for expansion. Switch datacenter however will be able to accommodate future growth in other SCIFs with no seeming impact to the overall cluster. The racks are divided into data racks, core racks and infrastructure racks. Infrastructure racks hold servers for Puppet, Nagios, Ganglia, DNS, DHCP, etc. whereas the core racks hold the servers for name node, job tracker node, Zookeeper, HBase, etc. The rack layout is as follows:

8 Rubicon The Rubicon team which is a part of VMware provides the Tier-1 an Tier-2 support for the cluster. This includes monitoring the network, hardware and various system level monitoring checks. The team uses Zabbix for systems management and has developed sophisticated plug-ins and dashboard on top of Zabbix. Rubicon team has a local presence in Las Vegas and is possesses the ability to provide rapid response to critical issues within the cluster Hadoop Software Overview Hadoop is an industry leading open source distributed file system that is designed to scale with growing data storage and compute needs of an organization. By using the same nodes for both compute and storage, a cluster can scale in both the dimensions simultaneously and avoid traditional bottlenecks of NAS/SAN type architecture. Below are some of the key components of Hadoop: Hadoop MapReduce: the parallel task processing mechanism that takes a query (job) and runs it in parallel on multiple nodes. The parallelism provides much better throughput for unstructured data sets that can be independently processed. Hadoop Distributed File System (HDFS): the base file system layer that stores data across all of the nodes in the cluster. MapReduce as a computing paradigm was popularized by Google and Hadoop was written and open sourced by Yahoo as an implementation of that paradigm. Hadoop MapReduce Hadoop MapReduce is a software framework for easily writing applications which process large amounts of data in-parallel on large clusters of commodity compute nodes. The diagram below depicts the basics of MapReduce workflow A MapReduce job (query) usually splits the input data-set into independent chunks size of each chunk is dependent on the system wide setting (typically 64MB). Each block is processed by the map tasks in a completely parallel manner. The framework sorts the output of the maps, which are then used as input to the reduce tasks. Typically both the input and the output of the job are stored in the HDFS. The framework takes care of scheduling tasks, monitoring them and managing the re-execution of failed tasks.

9 Typically in a Hadoop cluster, the MapReduce compute nodes and the storage layer (HDFS) reside on the same set of nodes. The system is configured to be rack aware therefore making is possible for the framework to effectively schedule tasks on the nodes where data is already present minimizing having to move data within a cluster of nodes. This is the compute layer that derives key insight from the data that resides in the HDFS layer. Hadoop is completely written in Java but MapReduce applications do not need to be. MapReduce applications can utilize the Hadoop Streaming interface to specify any executable to be the mapper or reducer for a particular job. The MapReduce framework consists of the following: JobTracker: single master per cluster of nodes that schedules, monitors and manages jobs as well as its component tasks. TaskTracker: one slave TaskTracker per cluster node that execute that task components for a job as directed by the JobTracker. In the upcoming release of Hadoop, the resource management module will undergo drastic rework. It will maintain backwards compatibility while effectively splitting the resource management capabilities into a standalone module. Hadoop Distributed File System Hadoop Distributed File System (HDFS) is a block based file system that allows user data to be stored in files. It retains the look and feel of a linux file system so that users or applications can create or remove files and directories as well as move or rename files and directories. HDFS does not support setting hard or soft links. All HDFS communication is layered on top of the TCP/IP protocol. Below are the key components for HDFS: NameNode: single master metadata server that has in memory maps of every file, file locations as well as all the blocks within the files and their locations within the HDFS namespace. The upcoming release of Hadoop, NameNode HA feature will be introduced to release some stress of the existing design (such as a single point of failure). DataNode: one slave DataNode per cluster node that serves read/write requests as well as performs block creation, deletion and replication as directed by the NameNode. This is the storage layer where all the data resides before a MapReduce job can run on it. HDFS uses block mirroring to spread the data around in the Hadoop cluster for protection as well as data locality to run MapReduce jobs on the same data but on multiple compute nodes. The default block size is 64 MB and the default replication factor is 3x. The copies are written in a rack aware manner so that all 3 copies do not reside on the same rack. Central idea behind replication being that if a rack goes down, system will still have access to full data set as much as possible.

10 Hadoop Ecosystem Hadoop ecosystem consists of the following main blocks: Hive: a SQL-like adhoc querying interface for data stored in HDFS Hbase: a column oriented structured storage system for HDFS Pig: high level data flow language and execution framework for parallel computation Mahout: scalable machine learning algorithms using Hadoop The above is not an exhaustive list of all Hadoop ecosystem components. Analysis Mahout Workflow Mgmt. SPRING BATCH Oozie Languages HIVE M/R PIG Exec. Env. HBASE File System HDFS Hadoop on the Greenplum Analytics Workbench For most, a typical Hadoop cluster consists of a name node, a few other master nodes and a whole lot of data nodes. The diagram below shows the Hadoop data nodes and the corresponding master nodes. A few of the master roles are hosted on the same machine to begin with. This scenario may change depending on the load on the system.

11 In reality, a typical Hadoop cluster is supported by a number of additional roles as shown in the diagram below Internet The table below provides a brief description of the server roles Access Data ingestion Web based management Jenkins Ganglia, Nagios and Zabbix Plato server Kickstart, DNS, DHCP and NTP YUM repo, Puppet master, Kerberos UFM The nodes are used to access the cluster. There is no direct access to the data nodes from outside. Typically these nodes support ssh based connectivity Data ingestion nodes are used for bulk upload of data into the cluster. These nodes can be used as a staging area for further processing prior to loading into HDFS Web based management nodes are used for accessing the cluster via HTTP Jenkins is an open source continuous integration framework. The server is used to build Hadoop code on demand or on a predefined trigger and provides for a dashboard to view the results These systems are used to monitor the cluster. For the analytics workbench, Zabbix is currently used to monitor the system level statistics whereas Nagios and Ganglia combination is used for application level monitoring This is deployed to monitor the health of the disks. It is actively monitored by the Rubicon team Kickstart server is used to load the base OS onto the nodes whereas DNS, DHCP and NTP are used for network management YUM repo is used as a repository for RHEL packages. It is used by the puppet master and the puppet agent running on each data node to access the packages that the slaves need for deployment. Kerberos is used as an authentication mechanism (needed as a part of secure Hadoop implementation) This is used as a the unified fabric manager for the Mellanox network

12 Terasort Example Industry benchmark TeraSort was run on the cluster. The cluster configuration wasn t optimized to the best possible tuning. The intent of the run was to simply validate the general health of the cluster as well as measure the TeraSort run characteristics. The first run was against 1TB and the second run was against 10TB of data. There are plans to run 100TB and even 1PB sort in the near future The results are shown below: ABOUT GREENPLUM Greenplum, a division of EMC, is driving the future of Big Data analytics with breakthrough products that harness the skills of data science teams to help global organizations realize the full promise of business agility and become data-driven, predictive enterprises. The division s products include Greenplum Unified Analytics Platform, Greenplum Data Computing Appliance, Greenplum Database, Greenplum Analytics Lab, Greenplum HD and Greenplum Chorus. They embody the power of open systems, cloud computing, virtualization and social collaboration, enabling global organizations to gain greater insight and value from their data than ever before possible. Learn more at CONTACT US To learn more about how Greenplum products, services, and solutions can help you realize Big Data Analytics opportunities, visit us at Greenplum, a Division of EMC 1900 South Norfolk Street San Mateo, CA Tel: EMC 2, EMC, the EMC logo, and Greenplum are registered trademarks or trademarks of EMC Corporation in the United States and other countries. All other trademarks used herein are the property of their respective owners. Copyright 2012 EMC Corporation. All rights reserved. Published in the USA. 05/12 White Paper