Greenplum Analytics Workbench
|
|
- Aubrey Lee
- 8 years ago
- Views:
Transcription
1 Greenplum Analytics Workbench Abstract This white paper details the way the Greenplum Analytics Workbench was designed and built to validate Apache Hadoop code at scale, as well as provide a large scale experimentation environment for mixed mode development that include various SQL and Non-SQL execution environments. It describes the core architectural components involved as well as highlights the benefits that an Enterprise can leverage to quickly and efficiently analyze Big Data. Table of Contents Introduction Partners Super Micro Micron... 3 Intel Seagate Mellanox Mellanox ConnectX -3 VPI Network Adapter Card Mellanox ConnectX -3 VPI card Specifications Mellanox SwitchX VPI Switches Mellanox SwitchX VPI Switch Specifications Mellanox Cables Mellanox FDR Passive Copper and Optical Cable Mellanox Unstructured Data Accelerator (UDA) Switch... 7 Rubicon Hadoop Software Overview Hadoop MapReduce Hadoop Distributed File System Hadoop Ecosystem Hadoop on the Greenplum Analytics Workbench TeraSort Example About Greenplum White Paper
2 Introduction Enterprises have been dealing with storing rapidly growing amounts of data truly big data available via traditional sources such as ERP, CRM, etc. to now social media, blogs, etc. The initial focus for the enterprise has been on efficient storage of big data but the focus has now shifted to analytics on the big data sets. Hadoop has emerged as the platform of choice for processing big data especially unstructured data. The out of box experience with Hadoop still has a lot to be desired as it lacks tools needed for ease of deployment and management of such an infrastructure; especially one needed to handle larger deployments. Enterprises are also quickly realizing that in order to maximize the analytics capabilities, a mixed mode environment is imperative. A mixed environment is where enterprises have an easy way of combining both the structured data sets (using traditional SQL) and unstructured data sets (using Hadoop/no- SQL) without having to rework existing processes. Greenplum Analytics Workbench is built to provide an environment that supports mixed mode development and validation at scale. The workbench is pre-configured with open and freely available data sets and has analysis software built-in for quick turnaround and rapid productivity. It will contain the entire Hadoop stack consisting of HDFS, M/R, PIG, HIVE, HBase, Mahout and augment it with SQL capabilities of Greenplum Database industry s leading MPP database; all deployed on the same nodes. Analytics Workbench provides a perfect experimentation platform for Greenplum s thought leadership in the Unified Analytics Platform. It also provides a tremendous learning opportunity for organizations that wish to build and operate a large Hadoop/Mixed mode cluster. Partners The Analytics Workbench was assembled with the help of strong partnership with some of the industries leading vendors of hardware and services. The Greenplum team forged close alliances with these partners to carefully assemble the hardware nodes, rack/stack and cable the nodes in a state of the art datacenter. An extremely fast network backbone and switching layer was designed to provide the cluster with blazing fast throughput for intra-cluster communication.
3 Supermicro Supermicro contributed 1,000 enterprise-ready server systems for all data-processing nodes to power the 24 peta bytes of storage available on the cluster. Supermicro fully assembled and tested the data-processing nodes in their Silicon Valley production center. The assembly process included integrating processors, memory, disk drives, and network cards from the other contributing partners. Supermicro s design team optimized the system configuration to address both data center space and power challenges. Supermicro servers maintain peak power efficiencies with platinumlevel 94% efficient power supplies. Supermicro then maximized node count per rack, reducing the overheads for power and cooling expenditure while delivering high performance and 24terabytes of storage requirements per data-processing node. The specifications of the Supermicro systems are as follows: 2U Greenplum Hadoop OEM Server - Model # PIO-626T-6RF-EM09B Supermicro Dual Processor motherboard supporting: - Dual Intel Xeon X5500/X5600 series processors - Up to 192GB RAM with 12 DDR3 RDIMMs - 5 expansion slots: 3 PCI-E 2.0 x8, 1 PCI-E 2.0 x4, 1 PCI-E x4 - Onboard LSI Gbps disk controller (IR mode) - Dual LAN with Intel Gigabit Ethernet Controller - Dedicated IPMI remote management port Supermicro 2U server chassis supporting: - 12 hot-swap 3.5 drive trays - Redundant 500 Watt platinum-level power supplies - 94+% efficiency rating - Active backplane with 6.0Gbps SAS/SATA expander - 7 low-profile PCI-E expansion slots Micron Micron contributed 6000 DDR3 RDIMM memory sticks with 8GB each. A total of 48TB of memory to be evenly distributed across 1000 nodes such that each node has a total of 48GB RAM. DDR3 functionality and operations supported as: 240-pin, registered dual in-line memory module (RDIMM) Fast data transfer rates: PC , PC3-8500, or PC GB (512 Meg x 72) VDD = 1.5V ±0.075V VDDSPD = V Supports ECC error detection and correction Nominal and dynamic on-die termination (ODT) for data, strobe, and mask signals Quad rank On-board I2C temperature sensor with integrated serial presence-detect (SPD) EEPROM Fixed burst chop (BC) of 4 and burst length (BL) of 8 via the mode register set (MRS) Selectable BC4 or BL8 on-the-fly (OTF) Gold edge contacts
4 Intel Intel contributed 2000 Westmere processors with the following specifications Processor Number X5670 # of Cores 6 # of Threads 12 Clock Speed 2.93 GHz Max Turbo Frequency 3.33 GHz Intel Smart Cache 12 MB Bus/Core Ratio 22 Intel QPI Speed 6.4 GT/s # of QPI Links 2 Instruction Set 64-bit Instruction Set Extensions SSE4.2 Embedded Options Available No Lithography 32 nm Max TDP 95 W Max Memory Size (dependent on memory type) 288 GB Memory Types DDR3-800/1066/1333 # of Memory Channels 3 Max Memory Bandwidth 32 GB/s Physical Address Extensions 40-bit ECC Memory Supported Intel Turbo Boost Technology Intel Hyper-Threading Technology Intel Virtualization Technology (VT-x) Intel Virtualization Technology for Directed I/O (VT-d) Intel Trusted Execution Technology AES New Instructions Intel 64 Idle States Enhanced Intel SpeedStep Technology Intel Demand Based Switching Thermal Monitoring Technologies No Execute Disable Bit Table 1: (courtesy:
5 Seagate Seagate contributed 12,000 2TB drives for a total of 24TB per node and 24PB raw storage across the cluster. The specification of the Seagate drives is as follows: Product Name 2TB Constellation ES 7200RPM 3.5 SATA 6GB/s 64MB Cache Hard Drive Product Type Hard Drive Buffer 64 MB Hard Drive Interface SATA/600 Compatible Drive Bay Width 3.5 SATA Pin 7-pin Height 1.0 Width 4.0 Depth 5.8 Product Series ES Form Factor Internal Product Model ST2000NM0011 Product Line Constellation Storage Capacity 2 TB Rotational Speed 7200 rpm Maximum External Data Transfer Rate 600 MBps (4.7 Gbps) Average Latency 4.16 ms Average Seek Time 9.50 ms Table 2: (courtesy: Mellanox Mellanox ConnectX -3 VPI Network Adapter Card Mellanox ConnectX -3 VPI card Specifications Part Number Supported Data Rates InfiniBand Supported Data Rates Ethernet MCX354A-FCBT FDR;QDR;DDR 40GbE;10GbE PCI Express generations support 3.0;2.0; 1.1 RDMA Support Supported Media Types Number of Ports and Types InfiniBand; RoCE Direct Attached Copper; Active Optical Cables, Optical Modules 2 Ports, QSFP+
6 Mellanox SwitchX VPI Switches Mellanox SwitchX VPI Switch Specifications Part Number Supported Data Rates InfiniBand Supported Data Rates Ethernet Port-to-Port Latency InfiniBand Port-to-Port Latency InfiniBand Blocking Ratio Number of Ports and Type Typical Power Consumption Supported Media Types MSX6036F-1SFR FDR;QDR;DDR 40GbE;10GbE 170ns 230ns 1:1 (Fully non-blocking) 36 Ports, QSFP+ 126W Direct Attached Copper; Active Optical Cables, Optical Modules Mellanox Cables Mellanox FDR Passive Copper and Optical Cable Greenplum Analytics Workbench connectivity is enabled by Mellanox s FDR cables. Both Passive Copper and Active Optical Cables are used to provide state-of-the-art cluster cabling solution as well as durability and ease of installation. Mellanox Unstructured Data Accelerator (UDA) Mellanox UDA, a software plugin, accelerates Hadoop network and improves the scaling of Hadoop clusters executing data analytics intensive applications. A novel data moving protocol which uses RDMA in combination with an efficient merge-sort algorithm enables Hadoop clusters based on Mellanox InfiniBand and 40/10GbE RoCE (RDMA over Converged Ethernet) adapter cards to efficiently move data between servers accelerating the Hadoop framework. The 1000 node Hadoop cluster is connected via blazing fast FDR, 56Gbps InfiniBand interconnect, using ConnectX -3 VPI cards and SwitchX VPI switches, as described above. The cluster is using 3 layers of switching: Node level switches that connect 20 servers in each rack to the aggregation layer using 4 FDR uplinks from each Top of Rack Switch (ToR). Aggregation layer switches that are connected to the core layer using 18 uplinks delivering full non-blocking InfiniBand network between the aggregation and core levels and the core level switches. The cluster utilizes IP over IB protocol to enable a more efficient connection to socket based portions of the framework. UDA provides the MapReduce portion of the framework the ability to utilize RDMA connectivity between nodes, reducing CPU overhead and enabling lower latency connections. The outcome of RDMA usage is significant reduction of processing time for similar size of datasets.
7 The interconnect layout of the network is as follows: Switch Switch is the state of the art datacenter in Las Vegas, NV where the Analytics Workbench is hosted. There are almost 3 full SCIFS consisting of 54 racks in all. Each rack holds 20 servers. A few racks are not completely filled leaving a bit of a room for expansion. Switch datacenter however will be able to accommodate future growth in other SCIFs with no seeming impact to the overall cluster. The racks are divided into data racks, core racks and infrastructure racks. Infrastructure racks hold servers for Puppet, Nagios, Ganglia, DNS, DHCP, etc. whereas the core racks hold the servers for name node, job tracker node, Zookeeper, HBase, etc. The rack layout is as follows:
8 Rubicon The Rubicon team which is a part of VMware provides the Tier-1 an Tier-2 support for the cluster. This includes monitoring the network, hardware and various system level monitoring checks. The team uses Zabbix for systems management and has developed sophisticated plug-ins and dashboard on top of Zabbix. Rubicon team has a local presence in Las Vegas and is possesses the ability to provide rapid response to critical issues within the cluster Hadoop Software Overview Hadoop is an industry leading open source distributed file system that is designed to scale with growing data storage and compute needs of an organization. By using the same nodes for both compute and storage, a cluster can scale in both the dimensions simultaneously and avoid traditional bottlenecks of NAS/SAN type architecture. Below are some of the key components of Hadoop: Hadoop MapReduce: the parallel task processing mechanism that takes a query (job) and runs it in parallel on multiple nodes. The parallelism provides much better throughput for unstructured data sets that can be independently processed. Hadoop Distributed File System (HDFS): the base file system layer that stores data across all of the nodes in the cluster. MapReduce as a computing paradigm was popularized by Google and Hadoop was written and open sourced by Yahoo as an implementation of that paradigm. Hadoop MapReduce Hadoop MapReduce is a software framework for easily writing applications which process large amounts of data in-parallel on large clusters of commodity compute nodes. The diagram below depicts the basics of MapReduce workflow A MapReduce job (query) usually splits the input data-set into independent chunks size of each chunk is dependent on the system wide setting (typically 64MB). Each block is processed by the map tasks in a completely parallel manner. The framework sorts the output of the maps, which are then used as input to the reduce tasks. Typically both the input and the output of the job are stored in the HDFS. The framework takes care of scheduling tasks, monitoring them and managing the re-execution of failed tasks.
9 Typically in a Hadoop cluster, the MapReduce compute nodes and the storage layer (HDFS) reside on the same set of nodes. The system is configured to be rack aware therefore making is possible for the framework to effectively schedule tasks on the nodes where data is already present minimizing having to move data within a cluster of nodes. This is the compute layer that derives key insight from the data that resides in the HDFS layer. Hadoop is completely written in Java but MapReduce applications do not need to be. MapReduce applications can utilize the Hadoop Streaming interface to specify any executable to be the mapper or reducer for a particular job. The MapReduce framework consists of the following: JobTracker: single master per cluster of nodes that schedules, monitors and manages jobs as well as its component tasks. TaskTracker: one slave TaskTracker per cluster node that execute that task components for a job as directed by the JobTracker. In the upcoming release of Hadoop, the resource management module will undergo drastic rework. It will maintain backwards compatibility while effectively splitting the resource management capabilities into a standalone module. Hadoop Distributed File System Hadoop Distributed File System (HDFS) is a block based file system that allows user data to be stored in files. It retains the look and feel of a linux file system so that users or applications can create or remove files and directories as well as move or rename files and directories. HDFS does not support setting hard or soft links. All HDFS communication is layered on top of the TCP/IP protocol. Below are the key components for HDFS: NameNode: single master metadata server that has in memory maps of every file, file locations as well as all the blocks within the files and their locations within the HDFS namespace. The upcoming release of Hadoop, NameNode HA feature will be introduced to release some stress of the existing design (such as a single point of failure). DataNode: one slave DataNode per cluster node that serves read/write requests as well as performs block creation, deletion and replication as directed by the NameNode. This is the storage layer where all the data resides before a MapReduce job can run on it. HDFS uses block mirroring to spread the data around in the Hadoop cluster for protection as well as data locality to run MapReduce jobs on the same data but on multiple compute nodes. The default block size is 64 MB and the default replication factor is 3x. The copies are written in a rack aware manner so that all 3 copies do not reside on the same rack. Central idea behind replication being that if a rack goes down, system will still have access to full data set as much as possible.
10 Hadoop Ecosystem Hadoop ecosystem consists of the following main blocks: Hive: a SQL-like adhoc querying interface for data stored in HDFS Hbase: a column oriented structured storage system for HDFS Pig: high level data flow language and execution framework for parallel computation Mahout: scalable machine learning algorithms using Hadoop The above is not an exhaustive list of all Hadoop ecosystem components. Analysis Mahout Workflow Mgmt. SPRING BATCH Oozie Languages HIVE M/R PIG Exec. Env. HBASE File System HDFS Hadoop on the Greenplum Analytics Workbench For most, a typical Hadoop cluster consists of a name node, a few other master nodes and a whole lot of data nodes. The diagram below shows the Hadoop data nodes and the corresponding master nodes. A few of the master roles are hosted on the same machine to begin with. This scenario may change depending on the load on the system.
11 In reality, a typical Hadoop cluster is supported by a number of additional roles as shown in the diagram below Internet The table below provides a brief description of the server roles Access Data ingestion Web based management Jenkins Ganglia, Nagios and Zabbix Plato server Kickstart, DNS, DHCP and NTP YUM repo, Puppet master, Kerberos UFM The nodes are used to access the cluster. There is no direct access to the data nodes from outside. Typically these nodes support ssh based connectivity Data ingestion nodes are used for bulk upload of data into the cluster. These nodes can be used as a staging area for further processing prior to loading into HDFS Web based management nodes are used for accessing the cluster via HTTP Jenkins is an open source continuous integration framework. The server is used to build Hadoop code on demand or on a predefined trigger and provides for a dashboard to view the results These systems are used to monitor the cluster. For the analytics workbench, Zabbix is currently used to monitor the system level statistics whereas Nagios and Ganglia combination is used for application level monitoring This is deployed to monitor the health of the disks. It is actively monitored by the Rubicon team Kickstart server is used to load the base OS onto the nodes whereas DNS, DHCP and NTP are used for network management YUM repo is used as a repository for RHEL packages. It is used by the puppet master and the puppet agent running on each data node to access the packages that the slaves need for deployment. Kerberos is used as an authentication mechanism (needed as a part of secure Hadoop implementation) This is used as a the unified fabric manager for the Mellanox network
12 Terasort Example Industry benchmark TeraSort was run on the cluster. The cluster configuration wasn t optimized to the best possible tuning. The intent of the run was to simply validate the general health of the cluster as well as measure the TeraSort run characteristics. The first run was against 1TB and the second run was against 10TB of data. There are plans to run 100TB and even 1PB sort in the near future The results are shown below: ABOUT GREENPLUM Greenplum, a division of EMC, is driving the future of Big Data analytics with breakthrough products that harness the skills of data science teams to help global organizations realize the full promise of business agility and become data-driven, predictive enterprises. The division s products include Greenplum Unified Analytics Platform, Greenplum Data Computing Appliance, Greenplum Database, Greenplum Analytics Lab, Greenplum HD and Greenplum Chorus. They embody the power of open systems, cloud computing, virtualization and social collaboration, enabling global organizations to gain greater insight and value from their data than ever before possible. Learn more at CONTACT US To learn more about how Greenplum products, services, and solutions can help you realize Big Data Analytics opportunities, visit us at Greenplum, a Division of EMC 1900 South Norfolk Street San Mateo, CA Tel: EMC 2, EMC, the EMC logo, and Greenplum are registered trademarks or trademarks of EMC Corporation in the United States and other countries. All other trademarks used herein are the property of their respective owners. Copyright 2012 EMC Corporation. All rights reserved. Published in the USA. 05/12 White Paper
The Greenplum Analytics Workbench
The Greenplum Analytics Workbench External Overview 1 The Greenplum Analytics Workbench Definition Is a 1000-node Hadoop Cluster. Pre-configured with publicly available data sets. Contains the entire Hadoop
More informationAchieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks
WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance
More informationUnstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
More informationDriving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
More informationHadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
More informationEnabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationCan High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
More informationSX1024: The Ideal Multi-Purpose Top-of-Rack Switch
WHITE PAPER May 2013 SX1024: The Ideal Multi-Purpose Top-of-Rack Switch Introduction...1 Highest Server Density in a Rack...1 Storage in a Rack Enabler...2 Non-Blocking Rack Implementation...3 56GbE Uplink
More informationHadoop on the Gordon Data Intensive Cluster
Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,
More informationSX1012: High Performance Small Scale Top-of-Rack Switch
WHITE PAPER August 2013 SX1012: High Performance Small Scale Top-of-Rack Switch Introduction...1 Smaller Footprint Equals Cost Savings...1 Pay As You Grow Strategy...1 Optimal ToR for Small-Scale Deployments...2
More informationIntegrated Grid Solutions. and Greenplum
EMC Perspective Integrated Grid Solutions from SAS, EMC Isilon and Greenplum Introduction Intensifying competitive pressure and vast growth in the capabilities of analytic computing platforms are driving
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationMellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct
Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 Direct Increased Performance, Scaling and Resiliency July 2012 Motti Beck, Director, Enterprise Market Development Motti@mellanox.com
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationHUAWEI Tecal E6000 Blade Server
HUAWEI Tecal E6000 Blade Server Professional Trusted Future-oriented HUAWEI TECHNOLOGIES CO., LTD. The HUAWEI Tecal E6000 is a new-generation server platform that guarantees comprehensive and powerful
More informationApache Hadoop Cluster Configuration Guide
Community Driven Apache Hadoop Apache Hadoop Cluster Configuration Guide April 2013 2013 Hortonworks Inc. http://www.hortonworks.com Introduction Sizing a Hadoop cluster is important, as the right resources
More informationInstalling Hadoop over Ceph, Using High Performance Networking
WHITE PAPER March 2014 Installing Hadoop over Ceph, Using High Performance Networking Contents Background...2 Hadoop...2 Hadoop Distributed File System (HDFS)...2 Ceph...2 Ceph File System (CephFS)...3
More informationComparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014
Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet Anand Rangaswamy September 2014 Storage Developer Conference Mellanox Overview Ticker: MLNX Leading provider of high-throughput,
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationI/O Considerations in Big Data Analytics
Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very
More informationSMB Direct for SQL Server and Private Cloud
SMB Direct for SQL Server and Private Cloud Increased Performance, Higher Scalability and Extreme Resiliency June, 2014 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server
More informationEntering the Zettabyte Age Jeffrey Krone
Entering the Zettabyte Age Jeffrey Krone 1 Kilobyte 1,000 bits/byte. 1 megabyte 1,000,000 1 gigabyte 1,000,000,000 1 terabyte 1,000,000,000,000 1 petabyte 1,000,000,000,000,000 1 exabyte 1,000,000,000,000,000,000
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationPower Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis
White Paper Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis White Paper March 2014 2014 Cisco and/or its affiliates. All rights reserved. This document
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationDell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/
More informationElasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack HIGHLIGHTS Real-Time Results Elasticsearch on Cisco UCS enables a deeper
More informationA Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department
More informationAPACHE HADOOP PLATFORM HARDWARE INFRASTRUCTURE SOLUTIONS
APACHE HADOOP PLATFORM BIG DATA HARDWARE INFRASTRUCTURE SOLUTIONS 1 BIG DATA. BIG CHALLENGES. BIG OPPORTUNITY. How do you manage the VOLUME, VELOCITY & VARIABILITY of complex data streams in order to find
More informationIntel Xeon Processor E5-2600
Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset
More informationDeploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions
WHITE PAPER May 2014 Deploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions Contents Executive Summary...2 Background...2 Network Configuration...3 Test
More informationDell Reference Configuration for Hortonworks Data Platform
Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution
More informationHow To Write An Article On An Hp Appsystem For Spera Hana
Technical white paper HP AppSystem for SAP HANA Distributed architecture with 3PAR StoreServ 7400 storage Table of contents Executive summary... 2 Introduction... 2 Appliance components... 3 3PAR StoreServ
More informationPower Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure
White Paper Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure White Paper March 2014 2014 Cisco and/or its affiliates. All rights reserved. This
More informationBuilding a Scalable Storage with InfiniBand
WHITE PAPER Building a Scalable Storage with InfiniBand The Problem...1 Traditional Solutions and their Inherent Problems...2 InfiniBand as a Key Advantage...3 VSA Enables Solutions from a Core Technology...5
More informationSolving I/O Bottlenecks to Enable Superior Cloud Efficiency
WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationCost Efficient VDI. XenDesktop 7 on Commodity Hardware
Cost Efficient VDI XenDesktop 7 on Commodity Hardware 1 Introduction An increasing number of enterprises are looking towards desktop virtualization to help them respond to rising IT costs, security concerns,
More informationEMC ISILON NL-SERIES. Specifications. EMC Isilon NL400. EMC Isilon NL410 ARCHITECTURE
EMC ISILON NL-SERIES The challenge of cost-effectively storing and managing data is an ever-growing concern. You have to weigh the cost of storing certain aging data sets against the need for quick access.
More informationMellanox Academy Online Training (E-learning)
Mellanox Academy Online Training (E-learning) 2013-2014 30 P age Mellanox offers a variety of training methods and learning solutions for instructor-led training classes and remote online learning (e-learning),
More informationSUN ORACLE EXADATA STORAGE SERVER
SUN ORACLE EXADATA STORAGE SERVER KEY FEATURES AND BENEFITS FEATURES 12 x 3.5 inch SAS or SATA disks 384 GB of Exadata Smart Flash Cache 2 Intel 2.53 Ghz quad-core processors 24 GB memory Dual InfiniBand
More informationUp to 4 PCI-E SSDs Four or Two Hot-Pluggable Nodes in 2U
Twin Servers Up to 4 PCI-E SSDs Four or Two Hot-Pluggable Nodes in 2U New Generation TwinPro Systems SAS 3.0 (12Gbps) Up to 12 HDDs/Node FDR(56Gbps)/QDR InfiniBand 1TB DDR3 up to 1866 MHz in 16 DIMMs Redundant
More informationHow To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (
Apache Hadoop 1.0 High Availability Solution on VMware vsphere TM Reference Architecture TECHNICAL WHITE PAPER v 1.0 June 2012 Table of Contents Executive Summary... 3 Introduction... 3 Terminology...
More informationArchitecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how EMC Elastic Cloud Storage (ECS ) can be used to streamline the Hadoop data analytics
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationDeploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution
More informationAnalyzing the Virtualization Deployment Advantages of Two- and Four-Socket Server Platforms
IT@Intel White Paper Intel IT IT Best Practices: Data Center Solutions Server Virtualization August 2010 Analyzing the Virtualization Deployment Advantages of Two- and Four-Socket Server Platforms Executive
More informationCost-Effective Business Intelligence with Red Hat and Open Source
Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationCisco UCS B440 M2 High-Performance Blade Server
Data Sheet Cisco UCS B440 M2 High-Performance Blade Server Product Overview The Cisco UCS B440 M2 High-Performance Blade Server delivers the performance, scalability and reliability to power computation-intensive,
More informationFLOW-3D Performance Benchmark and Profiling. September 2012
FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute
More informationExploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
More informationReference Design: Scalable Object Storage with Seagate Kinetic, Supermicro, and SwiftStack
Reference Design: Scalable Object Storage with Seagate Kinetic, Supermicro, and SwiftStack May 2015 Copyright 2015 SwiftStack, Inc. swiftstack.com Page 1 of 19 Table of Contents INTRODUCTION... 3 OpenStack
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationAn Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
More informationMellanox Accelerated Storage Solutions
Mellanox Accelerated Storage Solutions Moving Data Efficiently In an era of exponential data growth, storage infrastructures are being pushed to the limits of their capacity and data delivery capabilities.
More informationHADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director brett.weninger@adurant.com Dave Smelker, Managing Principal dave.smelker@adurant.com
More informationHortonworks Data Platform Reference Architecture
Hortonworks Data Platform Reference Architecture A PSSC Labs Reference Architecture Guide December 2014 Introduction PSSC Labs continues to bring innovative compute server and cluster platforms to market.
More informationIntel Core i3-2310m Processor (3M Cache, 2.10 GHz)
Intel Core i3-2310m Processor All Essentials Memory Specifications Essentials Status Launched Compare w (0) Graphics Specifications Launch Date Q1'11 Expansion Options Package Specifications Advanced Technologies
More informationScalable. Reliable. Flexible. High Performance Architecture. Fault Tolerant System Design. Expansion Options for Unique Business Needs
Protecting the Data That Drives Business SecureSphere Appliances Scalable. Reliable. Flexible. Imperva SecureSphere appliances provide superior performance and resiliency for demanding network environments.
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationCisco Unified Computing System and EMC VNXe3300 Unified Storage System
Cisco Unified Computing System and EMC VNXe3300 Unified Storage System An Ideal Solution for SMB Server Consolidation White Paper January 2011, Revision 1.0 Contents Cisco UCS C250 M2 Extended-Memory Rack-Mount
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationSun Constellation System: The Open Petascale Computing Architecture
CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical
More informationCisco 7816-I5 Media Convergence Server
Cisco 7816-I5 Media Convergence Server Cisco Unified Communications Solutions unify voice, video, data, and mobile applications on fixed and mobile networks, enabling easy collaboration every time from
More informationUnderstanding Hadoop Performance on Lustre
Understanding Hadoop Performance on Lustre Stephen Skory, PhD Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference 15
More informationSUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION
SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION AFFORDABLE, RELIABLE, AND GREAT PRICES FOR EDUCATION Optimized Sun systems run Oracle and other leading operating and virtualization platforms with greater
More informationInfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity
InfiniBand Switch System Family Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity Mellanox continues its leadership by providing InfiniBand SDN Switch Systems
More informationCisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a
More informationHADOOP AT NOKIA JOSH DEVINS, NOKIA HADOOP MEETUP, JANUARY 2011 BERLIN
HADOOP AT NOKIA JOSH DEVINS, NOKIA HADOOP MEETUP, JANUARY 2011 BERLIN Two parts: * technical setup * applications before starting Question: Hadoop experience levels from none to some to lots, and what
More informationBig Data - Infrastructure Considerations
April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright
More informationOracle Database Scalability in VMware ESX VMware ESX 3.5
Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises
More informationScalable. Reliable. Flexible. High Performance Architecture. Fault Tolerant System Design. Expansion Options for Unique Business Needs
Protecting the Data That Drives Business SecureSphere Appliances Scalable. Reliable. Flexible. Imperva SecureSphere appliances provide superior performance and resiliency for demanding network environments.
More informationRemoving Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC
More informationBig Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum
Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All
More informationORACLE BIG DATA APPLIANCE X3-2
ORACLE BIG DATA APPLIANCE X3-2 BIG DATA FOR THE ENTERPRISE KEY FEATURES Massively scalable infrastructure to store and manage big data Big Data Connectors delivers load rates of up to 12TB per hour between
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers
More informationPCI Express and Storage. Ron Emerick, Sun Microsystems
Ron Emerick, Sun Microsystems SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals may use this material in presentations and literature
More informationConverged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers
Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers White Paper rev. 2015-11-27 2015 FlashGrid Inc. 1 www.flashgrid.io Abstract Oracle Real Application Clusters (RAC)
More informationCDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
More informationHP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief
Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...
More informationCloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
More informationAccelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
More informationIBM System x family brochure
IBM Systems and Technology System x IBM System x family brochure IBM System x rack and tower servers 2 IBM System x family brochure IBM System x servers Highlights IBM System x and BladeCenter servers
More informationIntel RAID SSD Cache Controller RCS25ZB040
SOLUTION Brief Intel RAID SSD Cache Controller RCS25ZB040 When Faster Matters Cost-Effective Intelligent RAID with Embedded High Performance Flash Intel RAID SSD Cache Controller RCS25ZB040 When Faster
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationWhite Paper Solarflare High-Performance Computing (HPC) Applications
Solarflare High-Performance Computing (HPC) Applications 10G Ethernet: Now Ready for Low-Latency HPC Applications Solarflare extends the benefits of its low-latency, high-bandwidth 10GbE server adapters
More informationDEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER
DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER ANDREAS-LAZAROS GEORGIADIS, SOTIRIOS XYDIS, DIMITRIOS SOUDRIS MICROPROCESSOR AND MICROSYSTEMS LABORATORY ELECTRICAL AND
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationOracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture
Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture Ron Weiss, Exadata Product Management Exadata Database Machine Best Platform to Run the
More informationAccelerate Big Data Analysis with Intel Technologies
White Paper Intel Xeon processor E7 v2 Big Data Analysis Accelerate Big Data Analysis with Intel Technologies Executive Summary It s not very often that a disruptive technology changes the way enterprises
More informationVirtual Compute Appliance Frequently Asked Questions
General Overview What is Oracle s Virtual Compute Appliance? Oracle s Virtual Compute Appliance is an integrated, wire once, software-defined infrastructure system designed for rapid deployment of both
More informationPSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation
PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation 1. Overview of NEC PCIe SSD Appliance for Microsoft SQL Server Page 2 NEC Corporation
More information