ARISTA DESIGN GUIDE Big Data Applications: Baseline Configurations and Recommendations



Similar documents
HOW MUCH IS LACK OF VISIBILITY COSTING YOU? THE LAGGING NETWORK-WIDE VISIBILITY. ARISTA WHITE PAPER Arista Network Tracers

ARISTA WHITE PAPER Cloudifying Data Center Monitoring

IP ETHERNET STORAGE CHALLENGES

Arista 7060X and 7260X series: Q&A

Data Center Networking Designing Today s Data Center

Data Center Infrastructure of the future. Alexei Agueev, Systems Engineer

ARISTA NETWORKS AND F5 SOLUTION INTEGRATION

ARISTA WHITE PAPER Simplifying Network Operations through Data Center Automation

Pluribus Netvisor Solution Brief

SummitStack in the Data Center

Migrate from Cisco Catalyst 6500 Series Switches to Cisco Nexus 9000 Series Switches

Latency Analyzer (LANZ)

Arista EOS: Smart System Upgrade

Increase Simplicity and Improve Reliability with VPLS on the MX Series Routers

VMDC 3.0 Design Overview

Hadoop Cluster Applications

OVERLAYING VIRTUALIZED LAYER 2 NETWORKS OVER LAYER 3 NETWORKS

VMware Virtual SAN 6.2 Network Design Guide

Simplify Your Data Center Network to Improve Performance and Decrease Costs

BUILDING A NEXT-GENERATION DATA CENTER

VXLAN: Scaling Data Center Capacity. White Paper

SX1024: The Ideal Multi-Purpose Top-of-Rack Switch

The Impact of Virtualization on Cloud Networking Arista Networks Whitepaper

The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM CALIENT Technologies

How the Port Density of a Data Center LAN Switch Impacts Scalability and Total Cost of Ownership

Scalable Approaches for Multitenant Cloud Data Centers

ARISTA WHITE PAPER Solving the Virtualization Conundrum

Ethernet Fabrics: An Architecture for Cloud Networking

Flexible SDN Transport Networks With Optical Circuit Switching

Testing Network Virtualization For Data Center and Cloud VERYX TECHNOLOGIES

Brocade Solution for EMC VSPEX Server Virtualization

Impact of Virtualization on Cloud Networking Arista Networks Whitepaper

Arista and Leviton Technology in the Data Center

WHITE PAPER. Copyright 2011, Juniper Networks, Inc. 1

Networking in the Hadoop Cluster

SAN Conceptual and Design Basics

Network Virtualization and Data Center Networks Data Center Virtualization - Basics. Qin Yin Fall Semester 2013

Software-Defined Networks Powered by VellOS

Virtualizing the SAN with Software Defined Storage Networks

Juniper Networks QFabric: Scaling for the Modern Data Center

Switching Architectures for Cloud Network Designs

Cloud Networking: A Novel Network Approach for Cloud Computing Models CQ1 2009

STATE OF THE ART OF DATA CENTRE NETWORK TECHNOLOGIES CASE: COMPARISON BETWEEN ETHERNET FABRIC SOLUTIONS

Networking and High Availability

Architecting Data Center Networks in the era of Big Data and Cloud

A 10 GbE Network is the Backbone of the Virtual Data Center

SummitStack in the Data Center

High Availability. Palo Alto Networks. PAN-OS Administrator s Guide Version 6.0. Copyright Palo Alto Networks

Building Tomorrow s Data Center Network Today

Networking and High Availability

ARISTA WHITE PAPER Arista EOS CloudVision : Cloud Automation for Everyone

ARISTA AND EMC SCALEIO WHITE PAPER World Class, High Performance Cloud Scale Storage Solutions

Software Defined Cloud Networking

Radware ADC-VX Solution. The Agility of Virtual; The Predictability of Physical

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Brocade One Data Center Cloud-Optimized Networks

CloudEngine Series Data Center Switches. Cloud Fabric Data Center Network Solution

Cisco s Massively Scalable Data Center

ARISTA WHITE PAPER Deploying IP Storage Infrastructures

The Future of Cloud Networking. Idris T. Vasi

FlexNetwork Architecture Delivers Higher Speed, Lower Downtime With HP IRF Technology. August 2011

Extreme Networks: Building Cloud-Scale Networks Using Open Fabric Architectures A SOLUTION WHITE PAPER

Technical Bulletin. Arista LANZ Overview. Overview

Scaling 10Gb/s Clustering at Wire-Speed

Arista Software Define Cloud Networking

Data Center Use Cases and Trends

Optimizing Data Center Networks for Cloud Computing

Nutanix Tech Note. Configuration Best Practices for Nutanix Storage with VMware vsphere

Navigating the Pros and Cons of Structured Cabling vs. Top of Rack in the Data Center

The Road to Cloud Computing How to Evolve Your Data Center LAN to Support Virtualization and Cloud

Radware ADC-VX Solution. The Agility of Virtual; The Predictability of Physical

Fabrics that Fit Matching the Network to Today s Data Center Traffic Conditions

Non-blocking Switching in the Cloud Computing Era

White Paper. Network Simplification with Juniper Networks Virtual Chassis Technology

ARISTA WHITE PAPER Application Visibility and Network Telemetry using Splunk

Cisco Bandwidth Quality Manager 3.1

INTRODUCTION. ARISTA WHITE PAPER The Arista Advantage

Data Center Convergence. Ahmad Zamer, Brocade

Cisco s Massively Scalable Data Center. Network Fabric for Warehouse Scale Computer

A Whitepaper on. Building Data Centers with Dell MXL Blade Switch

Demonstrating the high performance and feature richness of the compact MX Series

Server Consolidation and Remote Disaster Recovery: The Path to Lower TCO and Higher Reliability

Simplify Data Management and Reduce Storage Costs with File Virtualization

Chapter 3. Enterprise Campus Network Design

Addressing Scaling Challenges in the Data Center

White Paper. Juniper Networks. Enabling Businesses to Deploy Virtualized Data Center Environments. Copyright 2013, Juniper Networks, Inc.

Virtualized Security: The Next Generation of Consolidation

100 Gigabit Ethernet is Here!

UCS Network Utilization Monitoring: Configuration and Best Practice

REMOVING THE BARRIERS FOR DATA CENTRE AUTOMATION

EMC VPLEX FAMILY. Continuous Availability and data Mobility Within and Across Data Centers

ARISTA WHITE PAPER Why Big Data Needs Big Buffer Switches

Benchmarking Hadoop & HBase on Violin

Global Headquarters: 5 Speen Street Framingham, MA USA P F

ORACLE OPS CENTER: PROVISIONING AND PATCH AUTOMATION PACK

Transcription:

ARISTA DESIGN GUIDE Big Data Applications: Baseline Configurations and Recommendations October 2014 Version 1.1 This document covers recommended components and tools to support Big Data Applications in the datacenter, leveraging the Universal Cloud Network Architecture. We will discuss the key elements for design, features to provide tight integration and visibility and the configurations themselves. In following parts to the series we will cover individual features in detail, their benefits, integrations and configurations.

TABLE OF CONTENTS INTRODUCTION... 4 ARISTA UNIVERSAL CLOUD NETWORK OVERVIEW... 4 SCALABILITY AND PREDICTABILITY... 5 RESOURCE UTILIZATION... 5 RELIABILITY... 5 PROGRAMMABILITY AND AUTOMATION... 6 EXTENSIBILITY... 6 MOBILITY... 6 COMPONENTS TO ENABLE BIG DATA APPLICATIONS... 7 OVERSUBSCRIPTION... 7 MONITORING... 8 LATENCY... 8 MOBILITY... 8 SCALABILITY... 9 MANAGEMENT... 9 AUTOMATION... 9 BIG DATA DESIGN COMPONENTS AND CONSIDERATIONS... 9 SWITCH BUFFERS... 9 MONITORING & VISIBILITY... 10 BEST PRACTICES... 11 ROUTING AT THE LEAF... 11 ROUTING PROTOCOL SELECTION... 11 HADOOP SPECIFIC CONSIDERATIONS... 12 TOOLS, FEATURES AND INTEGRATION... 12 MULTI-CLI... 12 RAIL... 14 ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 2

MRTRACER... 17 CLUSTER MONITORING INTEGRATION... 19 SUMMARY... 20 APPENDIX A PRODUCT OVERVIEWS... 22 ARISTA SWITCHES - DESIGNED FOR BIG DATA APPLICATIONS... 22 ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 3

INTRODUCTION Big Data and HPC applications place a large demand on the network infrastructure and in many cases is the key element that determines how well a cluster will perform as a whole. It s no secret that clusters are growing in size along with the demand for analysis and processing of the data ingested by these clusters. It is imperative that the cluster, including the compute, storage and network be properly designed to allow sufficient scale out, automation, integration, telemetry and monitoring to support and provide efficient operation and growth. The Arista Universal Cloud Network provides the foundation on which to build all of these capabilities. ARISTA UNIVERSAL CLOUD NETWORK OVERVIEW Network infrastructure is the underlying foundation of any cloud data center and must provide a platform to allow applications to operate easily, quickly and with without issues. However, in recent years, the nature of applications within the data center has been evolving, resulting in a significant changing in the nature of network traffic profiles. Whereas in the past, the majority of network traffic was in and out of the data center (so called North-South traffic), today the majority (80%) of all information flows remain inside the data center, with network traffic passing between servers or between servers and storage systems. This traffic is often referred to as East-West flows, describing a lateral movement of information between devices in the same DC. This contrasts with North-South traffic entering or exiting the DC, which is now only 20% of the total information flow. In addition to this shift in traffic profiles, the overall volume of application traffic has been rising exponentially, driven by the introduction of higher performance servers and storage systems, built on new generations of hardware made possible by the introduction of faster processors, multi-processor/multi-core systems, bigger and faster memory, improved I/O performance and faster disk storage. The increased demand for network performance is being further exasperated by the rapid adoption of Big Data and server virtualization along with the increased the use of network based storage. Server virtualization brings with it new forms of network traffic as a result use of workload balancing based on virtual machine live migration technologies such as VMware s VMotion. As a result, server network connections have been rapidly migrating from 1Gbps to 10Gbps to avoid server bottlenecks. One of the quick fixes often used to try and solve application performance problems has been to simply add bandwidth to the network, often through the simple in-line replacement of the old network equipment with faster devices and interconnects, but still deployed in a traditionally architected three-tier network. While this approach scales the bandwidth somewhat, it still assumes a traditional North-South bias to traffic flows, and does nothing to address the fundamental shift in application workloads, the demands of server virtualization or the increased use of network based storage. At best, this approach is a stopgap solution and ultimately fails to address the network scaling and performance challenges created by these ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 4

application architectures. What is required is to take a holistic approach to deploying next generation of Universal Cloud Network (UCN), taking into account all application requirements for today and in the foreseeable future. Here are some of the key factors contributing to the rapid adoption of a new generation of networks of based on the Universal Cloud Network (UCN) architecture. SCALABILITY AND PREDICTABILITY The shift in requirements driven by new application architectures and increased server and storage system performance has dramatically changed the demands placed on the network infrastructure. Applications demand consistent and predictable performance across the entire width of data center. While a large flat layer 2 using three-tier switch topologies seemed offer a potential solution, it has proved to only be a temporary fix and ultimately introduced many new issues and additional constraints. In order to solve these challenges, Arista uses a new approach with the advocates the use of two-tier predictable leaf and spine architecture and the use of N-way (currently up to 64-way) ECMP (Equal-Cost Multi-Path) IP routing, while continuing to support and develop Layer 2 based solutions using Multi Chassis Link Aggregation (MLAG) where the scaling demands are less severe. These solutions provides the elasticity within a UCN to be able either grow or shrink depending on the demands of the business. RESOURCE UTILIZATION Legacy topologies often depend on Spanning Tree Protocol (STP) or other mechanisms that use active-passive links in order to ensure Layer 2 loop prevention between switching tiers. The net result and main disadvantage of these traditional loop free topologies is that, under normal operating conditions, 50% of resources remain idle as they are held in reserve in a hot standby condition. One solution is to use a Virtual Local Area Network (VLAN) distribution model to create a degree of load balancing. With this approach, sets of VLANs use different spanning tree root bridges in order to allow all paths to be utilized e.g. a spanning tree for odd VLAN IDs is configured with the root bridge on one switch (with the other switch acting secondary root), while even VLAN IDs are set-up with the root bridge assignments reversed. While this approach provides a level of load balancing it increases complexity and the chance configuration error. Arista provides standard s based Multi Chassis Link Aggregation (MLAG) solutions or ECMP UCN designs, which provide active-active load balancing across all links in the network infrastructure, effectively doubling the Data Center (DC) capacity without deploying any more infrastructure or introducing additional configuration and management complexity. RELIABILITY Universal Cloud Networks by definition (they are Universal after all) demand reliability as a core tenet of their design. Arista s UCN architecture for the data center provides redundancy at the link and device level, using standards based protocols for interoperability and stability. ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 5

Reliability of the physical device is increased dramatically by using an operating system, Arista EOS (Extensible Operating System) that is fault tolerant, self-healing, and is capable of accelerated software upgrades (ASU). PROGRAMMABILITY AND AUTOMATION The majority of downtime experienced in networks is a result of human configuration errors. The typical saying, if it isn t broke, don t fix it has become commonplace following weekend changes or service enablement (especially among network operations managers). Coordination between different technical groups such as system, network, and security administrators is a time consuming, costly, and often error prone process which must be undertaken any time a new service is deployed in a network. The burgeoning network orchestration space promises to address the challenges faced in traditional operations models by creating well defined and tested workflows, which can enable network deployments at the touch of a button. However, programmability and automation is only possible by using an open Linux-based architecture. Arista s EOS is the only proven network operating system which is truly open and programmable to enable automation and offers the potential to be tightly integrated into the wider server and storage DevOps workflow processes. EXTENSIBILITY The simple concept known as the "Linus Law," named after Linus Torvalds, the creator of Linux, not only rings true for networking protocols but operating systems as well. The quote is simple and straightforward, "Given enough eyeballs, all bugs are shallow"; that s to say, the more people who see and test code, the more likely it is flaws will be caught and fixed quickly. This fundamental concept when applied to next generation networks is a game changer. The network is no longer a collection of complex specialized devices that require their own management systems, but is now a collection of servers with very specific functions. Once this approach is embraced, the true potential of the UCN can be exposed, in that switches can be operated in the same way, using he same principals and the very same tools as for servers. By using Arista s EOS, an UCN can now be tailored to the organization and customized for a particular business requirement. The possibilities of this type of extensibility are now only limited by the network operator s creativity. MOBILITY Due to the widespread adoption of server virtualization, all UCN s require inherent support for workload mobility within and, in some cases, between data centers. Simply put, the network infrastructure must inherently provide hardware and software support for workload mobility. Data center mobility can be described in various ways: ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 6

Facilities Overcoming facilities challenges for example when a rack has run out of space but there is still a business requirement for the additional applications elements to be layer 2 adjacent (i.e., within the same layer 3 IP subnet). Traditional mechanisms for stretching VLAN across of the data center are fraught with challenges and risk. In this situation, the use using network virtualization offers the option of placing servers in a physically disparate location and joining them over a virtual overlay network. Using Arista s hardware based Virtual exensible Local Area Network (VXLAN) solution can overcome many facilities challenges. Migration to New Physical Infrastructure Traditionally, the move of applications from old to new infrastructure involves intrusive change controls to tear down the old, build new infrastructure and then migrate applications. Network virtualization solutions allow simultaneous bridging of new and old domains and move the applications real time, thereby, providing flexibility. Pooling of Resources Rather than having to place firewalls or load balancers in each rack, resources can be consolidated and pooled together. Using Arista s VXLAN offering, resources can now be pooled and consolidated thereby reducing Capital Expenditure (CapEx) and Operational Expenditure (OpEx) costs. Additionally, infrastructure should be scalable to provide overlay networks for VM mobility and VLAN scale so limitations of a legacy design are never encountered. Enter VXLAN; a better way to do L2 at scale. COMPONENTS TO ENABLE BIG DATA APPLICATIONS Big Data Applications are very demanding on the network as they often present many-to-one traffic flows, otherwise known as incast. This presents a challenge to the network and operators that demands an understanding of how the network is being utilized. OVERSUBSCRIPTION Oversubscription is the ratio of contention should all devices send traffic at the same time. It can be measured in a North-South direction (traffic entering/leaving a data center) as well as East-West (traffic between devices in the data center). Many legacy data center designs have very large oversubscription ratios, upwards of 20:1 for both North-South and East-West, because of the large number of tiers and limited density/ports in the switches. This architecture proved adequate for legacy client-server applications, dedicated storage architecture and because of the historically much lower network performance of servers. However this highly oversubscribed, multi-hop model does meet the needs of distributed workload computing with ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 7

large data sets, server virtualization, and network based storage systems that now form the backbone to many modern business applications. MONITORING Microbursts caused by traffic patterns such as those seen in TCP in-cast situations are invisible at the granularity afforded by traditional network monitoring methods, such as, SNMP polling. Low polling intervals, even as low as one second, do not work because they not only are detrimental to system performance (i.e., because the CPU needs to process the agent s requests) but also are nowhere near the level of frequency required to detect a microburst, that may be a few milliseconds in duration. However, such microbursts can increase latency and ultimately result in packet discards that require retransmission, impacting application performance and data integrity. It is therefore extremely important to have tools that allow microbursts to be detected and their root cause to be identified LATENCY Minimizing latency in traditional data center designs is not a practical goal. Where low latency is required, the number of switch hops must be reduced. Every network node and every link adds latency between devices. With six or more switch hops between servers being quite common in traditional data center designs, the corresponding end-to-end latency can be intolerably high. Furthermore, with heavy reliance on a core switching infrastructure, with relatively low numbers of ports, and thus, high oversubscription ratios, additional queuing latency caused by congestion becomes more likely, compounding the issues already intrinsic to designs with large numbers of hops between nodes. MOBILITY Virtual server mobility often requires the cluster members to reside in a common IP subnet, in turn requiring layer 2 connectivity between hosts. Many legacy switches lack the capability to extend and scale the number of layer 2 networks without resorting to either proprietary, lock-in technologies. The other approach is to stretch VLANs beyond the normally accepted boundaries. Every time a VLAN must be stretched beyond the confines of recommended best practice, the network architecture is compromised and, as a result, is placed at greater risk of an outage due to human error such as misconfiguration or protocol issue (e.g. spanning tree). Furthermore, the VLAN number system imposed on the original, partitioned, data center design can result in overlapping or duplicate VLAN IDs being present, causing further complications requiring VLAN IDs to be re-assigned or VLAN translation to be configured. Simply put, legacy architectures built on legacy hardware are unable to support the mobility requirements of virtual server infrastructures. ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 8

SCALABILITY In a legacy DC design, scale was typically considered to be a cascading function of oversubscription ratios from core to access layers. In order to scale the number of attached devices, ever increasing numbers of core, distribution and access branches needed were added, significantly increasing the capital cost and operational complexity of the network. With the traffic patterns in the data center having migrated from the traditional North-South patterns to a prevalent East-West traffic profile, the resulting data center traffic now flows are predominately between devices attached to the access tier. As a result, in traditional data center network architectures, traffic has to climb up and down the core, distribution and access trees multiple times. This legacy approach to network design is no longer practical or sustainable. MANAGEMENT Management of the data center is a mammoth task. The network components of an UCN must be equipped with the necessary tools to solve problems that legacy infrastructure could not. Low cost tap aggregation, microsecond buffer visibility, and automated deployment mechanisms to enable the data center to scale with minimal operational costs must be available to the engineering and operations teams. Without such tools, data centers will continue to be expensive and cumbersome to operate. Furthermore, the switching operating system should provide for extensibility so that customers can provide the functionality required to handle their particular operational needs. AUTOMATION Heavy reliance on spanning tree requires extensive manual planning, from the point of view of assigning VLAN IDs, to determine which ports can be actually used, to capacity planning for a platform or link. This extensive engineering is counter to the concept of automated deployments. In the case of a virtualized data center, not only does the administrator need to determine whether a VLAN exists on the desired target switch, but can it even be present at that point in the network. The rigidity resulting from legacy designs is not compatible with a heavily automated virtualized data center. BIG DATA DESIGN COMPONENTS AND CONSIDERATIONS SWITCH BUFFERS Big Data applications, by their very nature, can generate large volumes of network traffic that can be both transient and asymmetric in nature. Even though overall volumes of data transfer may be moderate, short term, large transfers can take place du ring the key phases of MapReduce processing. Likewise, this traffic can have a profile that has a large number of senders and a relatively small group of receivers, creating an incast traffic profile that can easily overwhelm the buffering capacity of normal Data Center switches, resulting in traffic being discarded and the need for retransmission. The net effect of lost data can mean a longer time ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 9

to complete MapReduce workloads and inconsistent workload performance, ultimately reducing the timeliness and effectiveness of any information generated. For these reasons, and based on real world experience building networks for Big Data applications, Arista recommends using the Arista 7500E as a Spine switch for Big Data networks. The Arista 7500E modular chassis switch has exceptionally large buffers, capable of absorbing large traffic bursts without discarding packets. Each packet processor on the 7500E has 3GB of off-chip buffer memory (in addition to the 2MB on-board). Depending on the version of line card used, there is either 9GB (i.e., 3 packet processors) or 18GB (i.e., 6 packet processors) of packet buffer per module. The 7500E has a cell based switch fabric capable of delivering 3.84Tbps of switching capacity per line card and employs and Virtual Output Queuing (VoQ) architecture for lossless switching. Table 1 shows the buffer scale for the Arista 7500E switches Table 1: Default per-voq Output Port Limits Output Port Characteristics Maximum Packet Queue Depth Maximum Packet Buffer Depth (MB) VoQ for a 10G output port 50,000 packets 50 MB VoQ for a 40G output port 200,000 packets 200 MB VoQ for a 100G output port 500,000 packets 500 MB In light of the growth in the adoption rate Big Data applications across all organizations, and the increase in scale for these solutions, Arista has recently introduced the Arista 7280SE family of 1RU switches which are based on the same switching silicon and buffering architecture as the 7500E. With up to 72 10Gbps ports and 9GB of buffering, the 7280SE is an ideal choice as a Spine switch for smaller scale environments or as a leaf switches in very demanding situations where buffer scale in warranted for both leaf and spine switches. MONITORING & VISIBILITY Predictable time to completion is a key operational metric for any Big Data solution. Therefore, the ability to identify and promptly rectify any network performance issues that can affect a job run times is key to maintaining the viability of any solution. Arista takes the instrumentation of any network very seriously and has built extensive network analysis and diagnostic tools into its switches. For example, Arista Latency ANalyZer (LANZ) on the Arista 7150S family of switches provides real time monitoring of switch queue occupancy levels and can be configured to provide warnings as and when given threshold levels are exceeded. As ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 10

well as being logged, these warning can be streamed to external tools to provide a detailed, real-time alerting of a potential network congestion condition. The command-line interface can also be used to drill deeper into LANZ information, furthering aiding in the identification of the root cause of the nascent congestion. All this allows for potential issues to be identified and remedial action to be planned and performed well in advance of any direct impact on application performance. There are many advanced features within the LANZ toolset designed to assist in diagnosing the root cause of network performance issue. For example, the 7150S advanced mirroring capabilities allow for traffic that is contributing to high queue occupancy to be automatically copied to an external port or the switch CPU, where tools such as TCPdump and Wireshark can be used to perform in-depth protocol analysis of the traffic involved. In addition, the 7500E and 7280 series switches provide LANZ-Lite functionality to provide insight on buffer utilization and hot spots within the network/cluster. These hot spots may indicate an unbalanced cluster which may prompt the reassignment of resources. BEST PRACTICES ROUTING AT THE LEAF Hadoop is a layer 3 aware filesystem. There is no real benefit to be seen by the user to building a large layer 2 network infrastructure. A layer 3 solution at the top of rack provides several benefits - First, it limits the broadcast and fault domain to a single rack. It s customary to have three copies of each data set in a Hadoop cluster, limiting your fault domain to a single rack ensures the third copy of a particular data set is unaffected by a problem that is affecting the primary two copies. Routing at the top of rack also allows for easy horizontal growth for your cluster. You may start out with just a handful of racks, but starting out with a subnet per rack allows you to easily and consistently add on to the cluster. Another important benefit of routing at the top of rack is the ease of troubleshooting. How many of us have spent too much time tracing a mac address from switch to switch during hot troubleshooting moment? There is no need to be in that position when supporting a Big Data cluster. ROUTING PROTOCOL SELECTION The choice of routing protocol has several items to consider as well. If your cluster is not very large and you are very confident that it won t grow to be more than a few thousand nodes then OSPF is a fine choice. Not many of us will complain about the ease of configuration. Utilizing stub networks at each leaf device is another benefit as it decreases the chattiness of the protocol limiting the LSA advertisements. The other obvious choice of a routing protocol is BGP. While being a bit more complicated of a configuration it has many benefits to think about. BGP can easily scale from a single rack to hundreds of racks and many thousands of nodes. Even if you start out with a small implementation making the right decision in the beginning can ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 11

save huge headaches later on as the environment grows. While BGP offers unprecedented scalability and performance, it also offers a large amount of configuration options available to the network engineer to influence traffic in any way they see fit. In any case, either of these solutions follow an open standard which allows for vendor interoperability, not being locked in to a proprietary solution. Proprietary solutions can seem tempting at first, but they certainly can cause problems later if you are looking to get out of them, or even just looking for some leverage when negotiating pricing for expansion. HADOOP SPECIFIC CONSIDERATIONS For large Hadoop/Big Data implementations that span many racks it is vital to ensure that each particular data set exists on different racks. Hadoop rack awareness provides the ability to dynamically map the Hadoop cluster with the network topology. This feature provides automatic block replication and balancing. Furthermore, rack awareness can help keep most bursty traffic local to rack rather than unnecessarily eating up bandwidth to the spine on which you may or may not have oversubscription depending on your architecture. Customers of Arista have the added benefit of loading an EOS extension to their switches that will allow them to automatically update the Hadoop rack awareness configuration. It is very important to ensure the rack awareness configuration is up to date as it ensures data is distributed properly and no single point of failure exists. TOOLS, FEATURES AND INTEGRATION MULTI-CLI Some of the challenges in large data center networks used for Big-Data applications are maintenance, troubleshooting and availability. This requires tools to manage the operational status and performance of the network. It also requires the ability to find issues and the root cause quickly, and resolve them. One way that Arista has approached managing the network is to provide new ways to interact with the switches using the Multi-CLI capability called CloudVision. Arista pioneered the ability to use XMPP as a messaging bus across Arista switches running EOS. Multi-CLI is the ability to use an open XMPP client supported on our native Linux operating system to have quick access to and interact with Arista switches individually or in groups. In large HPC and Big Data datacenters, the collection runs as a large collective supercomputer. Multi-CLI can help keep all of the elements working and increase overall uptime and availability by reducing the time it takes to access the network infrastructure and get the right information to find issues, and to tune the network for top performance. The components necessary to use the Multi-CLI feature, CloudVision, start by using an XMPP server. For most applications we recommend using the open-source ejabberd server on Linux that scales very high in terms of messages per second and total connections supported. It also ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 12

provides a very flexible security model. The next component is to enable the XMPP client on the switches. The switches will appear as individual XMPP contacts that can be sent messages. This feature is usually used over the management Ethernet interface however it is supported in-band across the switch. AAA, management plane ACLs and policing can be used to secure the feature and only expose it to authorized networks and administrators. Once enabled, administrators can setup the XMPP server on their instant messaging application of choice, and will be able to see all of the switches in their contact list. Switches can belong to switch groups, to group switches by location, switch type, or role in the network. Figure 1: Arista Switches showing up as contacts in Adium, and output of show version echoed back. Figure 2: Multi-CLI example on the CLI of an Arista switch, seeing other XMPP enabled switches and the output of show version echoed back. To enable Multi-CLI, or CloudVision, perform the following steps. This assumes you already have an XMPP server setup. veos>enable veos#config t veos(config)#management xmpp veos(config-mgmt-xmpp)#username <switch_username>@<domain_name> password <password> veos(config-mgmt-xmpp)#server <hostname IP_addr> veos(config-mgmt-xmpp)#domain <domain_name> ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 13

veos(config-mgmt-xmpp)#no shutdown Now the switch should have the Multi-CLI capability and should be connected to the XMPP server. This can be verified by enter the following command. veos1(config-mgmt-xmpp)#show xmpp status XMPP Server: 10.100.1.100 port 5222 Client username: veos1@aristademo.com Default domain: aristademo.com Default privilege level for received commands: 15 Connection status: connected Commands can now be sent to the switch, or from the CLI of this switch to another. To setup switchgroups, which enables the switch to receive messages sent to groups of switches enter the following commands. To configure more switchgroups, just reenter this command with the different group names the switch will be apart of. veos1(config-mgmt-xmpp)#switch-group <name>@conference.<domain_name> Once switches belong to a switchgroup, you can join that group and see the switches in the group, then send commands that get executed by each switch and the output returned, as in the example below. Figure 3: A group chat session with switches that belong to a switchgroup. A single command is entered and each switch returns any output generated. RAIL Rapid Automated Indication of Link-loss (RAIL), is a feature Arista created to rapidly indicate server failure. In many HPC and Big-Data applications, such as Hadoop, there is a job tracker or resource manager that manages the tasks processing on nodes doing the actual work. The job tracker takes in requests and assigns the work to the relevant nodes that have or are near the data being processed. This process is used to massively parallelize the work across very large data sets using many compute nodes. In a perfect scenario, this process is very efficient and fast in returning results, be it a search, a pattern match, or count. However if one or more ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 14

nodes, of possibly thousands being used in a task go down, this can add delay in getting the results to the client application. Arista developed the RAIL feature to signal back to clients or the job tracker in the case of Hadoop, when directly attached nodes go down. Arista switches with RAIL enabled, accomplish this by tracking servers or nodes that use the switch as a layer-3 default gateway and thus will have an entry in the MAC and ARP tables of the switch. The job tracker that is waiting for the nodes to complete the task and deliver the results will receive TCP resets if the link to the node goes down. The switch will send ICMP unreachables for non-tcp traffic destined to the failed server. The job tracker, or any host that has an active TCP session or IP conversation to servers connected to the Arista switch, will quickly drop the open connection and rapidly failover. This is much faster than waiting on the protocol timers to expire. In the case of a job tracker, it can quickly reassign the task to other nodes to complete the task faster. Similar behavior is exemplified with HBase failures, going from seconds to milliseconds for recovery. Figure 4: RAIL Operation To use the RAIL feature, the network must be designed with Layer 3 at the Top of Rack (ToR) switches. The ToR switches need to be the default-gateway of the servers in order for them to proxy for the attached servers and send out the TCP reset and ICMP unreachable messages. Assuming the network is design the following steps are used to enable RAIL. ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 15

switch(config)# monitor server-failure switch(config-server-failure)# no shutdown switch(config-server-failure)# proxy lifetime <1-10080> switch(config-server-failure)# network <Local_IP_Subnet> switch(config-server-failure)# exit switch(config)# interface Ethernet1 switch(config-if-et1)# monitor server-failure link switch(config)# interface Port-Channel 34 switch(config-if-po34)# monitor server-failure link The proxy lifetime timer is used to stop the switch from acting as a proxy for servers that have failed and either come back online or are down for long period of time. The network command tells the switch to monitor servers connected on the subnet that will be directly connected. Next the switch is configured to monitor the link on the individual Ethernet and Port-Channel interfaces. To verify the operation of the RAIL feature use the following command, which will generate output that shows the status and settings of the feature, along with the networks that are configured for monitoring. switch(config)#show monitor server-failure Server-failure monitor is enabled. Proxy service: enabled Proxy lifetime: 10 minutes Networks being monitored: 3 10.1.0.0/16 : 3 servers 44.11.11.0/24 : 1 servers 132.23.23.0/24 : 1 servers Next, the servers that the switch is monitoring can be seen using the following command. The IP and MAC addresses are derived from the MAC and ARP tables. switch(config)#show monitor server-failure servers all Total servers monitored: 5 Server IP Server MAC Interface State Last Failed ---------- ------------ ----------- ------- ------------- 10.1.67.92 01:22:ab:cd:ee:ff Ethernet17 inactive 7 days, 12:47:48 ago 44.11.11.7 ad:3e:5f:dd:64:cf Ethernet23 down 0:06:14 ago 10.1.1.1 01:22:df:42:78:cd Port-Channel6 up 4:38:01 ago 10.1.8.13 01:33:df:ee:39:91 Port-Channel5 proxying 0:10:31 ago 132.23.23.1 00:11:aa:bb:32:ad Ethernet1 up never More details can be seen, by zeroing in on a specific server using the following command. switch(config)#show monitor server-failure servers 44.11.11.7 Server information: Server Ip Address : 44.11.11.7 MAC Address : ad:3e:5f:dd:64:cf Current state : down Interface : Ethernet23 Last Discovered : 2013-01-06 06:47:39 Last Failed : 2013-02-10 00:07:56 Last Proxied : 2013-02-10 00:08:33 Last Inactive : 2013-02-09 23:52:21 ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 16

Number of times failed : 3 Number of times proxied : 1 Number of times inactive : 18 MRTRACER Arista created the MapReduce Tracer feature to provide Big-Data administrators visibility in the operational status, job status, and performance of large Hadoop clusters. Hadoop consists of two main elements; MapReduce and Hadoop Distributed File System(HDFS). MapReduce performs a data mapping function followed by the reduce function to yield a consolidated result to the client application. Both processes are controlled by the JobTracker and performed on the TaskTracker. The Hadoop Distributed File System(HDFS) is a distributed file system with built-in replication for data redundancy. The NameNode manages the positioning and location of data, stored across the cluster and DataNodes. Figure 5: This example demonstrates how the JobTracker interacts with the TaskTracker functions on each compute node. MapReduce Tracer tracks and interacts with Hadoop nodes directly connected to Arista switches. It provides valuable insight on the operation of Hadoop clusters and their workloads for both network and big-data administrators. The feature focuses on MapReduce and does not monitor HDFS components directly (although it is aware of HDFS traffic). Information provided includes: Hadoop cluster status, showing JobTracker name and number of active TaskTrackers Network location (e.g. switch/interface) and state of TaskTrackers Breakdown of running jobs, with detailed insight of the server activity and network workload: o Job name, job type (i.e., Map or Reduce) and percentage complete o Bytes in, bytes out etc. o Shuffle activity o HDFS related workload ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 17

o MapReduce job history, including job type, run-times to completion and network traffic load o Traffic burst information, with burst sizes and time of burst MapReduce Tracer communicates with JobTracker using Hadoop RPC without the need for server plug-ins or agents to be installed on the nodes. MapReduce Tracer then communicates with local TaskTrackers on the nodes using HTTP to their status. Figure 6: This example shows how MapReduce Tracer interacts with Hadoop. To enable MapReduce Tracer follow the steps below. This assumes the network is configured with Layer 3 routing at all ToR switches and the default-gateway on the servers pointing to their respective ToR switch. First switches are configured to monitor hadoop, and then configure the cluster name, and point to the JobTracker. Switch#configure Switch(config)#monitor hadoop Switch(config-monitor-hadoop)#cluster VLAB Switch(config-monitor-hadoop-VLAB)#jobtracker host hadoop101 Switch(config-monitor-hadoop-VLAB)#jobtracker username hduser Switch(config-monitor-hadoop-VLAB)#no shutdown Switch(config-monitor-hadoop-VLAB)#exit Switch(config-monitor-hadoop)#no shutdown Switch(config-monitor-hadoop) To verify MapReduce Tracer, here are some show commands to see if the feature is getting the information from the JobTracker and TaskTrackers along with the output of the information that MapReduce Tracer provides. Switch#show monitor hadoop counters Last updated: 2014-02-22 14:22:38 Counters for running jobs: JobId Job Name Cluster Bytes In Bytes Out Start Time ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 18

----- ----------- -------- --------- --------- ------------- 11 word coun VLAB 12.75MB 939.99KB 2014-02-23 10:31:54 4 TeraGen VLAB 167MB 305.93MB 2014-02-22 11:50:03 Note: these counters are derived from Hadoop counters and represent approximate network bandwidth utilization Switch#show monitor hadoop traffic burst Last updated: 2014-02-22 14:48:42 Bursts on Interface: 'Ethernet6' in cluster: VLAB Input bursts: No input bursts available Top 5 output bursts: JobId Job Name Burst Time ----------- -------------- ------------ ------------------- 4 TeraGen 2.25MB 2014-02-22 13:21:38 4 TeraGen 2.09MB 2014-02-22 13:19:39 4 TeraGen 1.97MB 2014-02-22 13:35:30 4 TeraGen 1.97MB 2014-02-22 14:06:01 4 TeraGen 1.71MB 2014-02-22 13:24:06 Bursts on Interface: 'Ethernet4' in cluster: VLAB Top 1 input bursts: JobId Job Name Burst Time ----------- -------------- ----------- ------------------- 7 TeraGen 83 2014-02-22 13:40:05 Switch#show monitor hadoop tasktracker interface ethernet 4 counters Last updated: 2014-02-23 10:45:51 Running job for TaskTracker: hadoop101 JobId Job Name Cluster Bytes In Bytes Out Start Time -------- ------------- ---------- ----------- ------------- ------------ ------- 1 word count VLAB 12.75MB 939.99KB 2014-02-23 10:31:54 2 TeraGen VLAB 164 1.39MB 2014-02-23 10:40:44 CLUSTER MONITORING INTEGRATION An important aspect of Big Data clusters that is often overlooked is the need for integrated monitoring and telemetry. In order to completely understand the performance and health of a cluster each component of it must be monitored and correlated. This means compute and its subcomponents (CPU, RAM, Disk I/O, etc.), networking and other services requisite to the operation of the cluster. ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 19

It is important to monitor the network at both the switch and compute and correlate the data to determine if and where you may be having issues. TCP Stack performance is often overlooked or simply no considered when monitoring compute elements in the cluster. However, TCP performance can provide important indicators of problems. For instance, the compute node may be encountering TCP stack performance issues while at the same time experiencing high user space to kernel CPU utilization. This may indicate that there is too much load at the user space level and a reduction in thread processing may be required. However, if loss within the network is also being logged, the problem may exist within the network itself. Figure 7: Example TCP retransmission rate data collected from a Hadoop data node There are many open source and commercial tools and resources available to leverage. Examples include Splunk, OpenTSDB, Ganglia, Nagios, etc. What tool is chosen is routinely based upon user preference or operational factors. Regardless of which application is selected, it is important that a data correlation tool be used to quickly and easily identify performance problems within a cluster. SUMMARY Big data provides a very interesting application problem that crosses the lines between network, storage, and application. Big data can provide tremendous insight and tremendous business value. A properly architected big data cluster can provide disruptive and important knowledge to traditional enterprise customers. However, applying legacy architectural ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 20

principles with massive oversubscription and minimal buffering will degrade the overall big data cluster and in turn reduce the cluster s effectiveness. Arista is committed to supporting big data clusters in the way they were designed to operate with a non-blocking, deep buffered, high-speed data center network. This coupled with Arista s EOS, the world s most advanced network operating system, allows best-in-class native integration with popular big data distributions such as Hadoop. ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 21

APPENDIX A PRODUCT OVERVIEWS ARISTA SWITCHES - DESIGNED FOR BIG DATA APPLICATIONS In designs where deep buffers can be a key requirement to handle the asymmetric in-cast workloads and bursty traffic profiles associated with Big Data applications, the Arista 7500E is recommended along with the 7280 series switches in a spine/leaf configuration. Arista 7500E Chassis Designed for large virtualized data centers and cloud networks the Arista 7500 Series modular switches are the industry s highest performance data center switches, available in a compact 7RU (4-slot) or 11RU (8-slot) they combine scalable L2 and L3 resources with advanced features for network monitoring, precision timing, network virtualization to deliver scalable and deterministic network performance for mission critical data centers. Figure 8: Arista 7500E Spine The Arista 7500E is the second generation of the 7500 Series and delivers seamless upgrades ensuring investment protection of first generation fabrics, linecards and common equipment. This sets a new standard for performance, density, reliability, and power efficiency. The Arista 7500E Series offers over 30Tbps of total capacity for 1,152 ports of 10GbE or 288 ports of 40GbE and support for 96 ports of wire-speed 100GbE using integrated optics that support flexible combinations of 10G, 40G and 100G modes on a single interface. With front-to-rear airflow, redundant and hot swappable supervisor, power, fabric and cooling modules the system is purpose built for data centers. The 7500E Series is energy efficient with typical power consumption of fewer than 4 watts per port for a fully loaded chassis. All of these attributes make the Arista 7500E an ideal platform for the customer to build reliable, low latency, resilient and highly scalable data center network. ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 22

Arista 7500E Line Cards Wire-speed line cards deliver up to 14.4 Billion packets per second of forwarding with a distributed virtual output queue architecture and lossless fabric that eliminates head-of-line blocking and provides fairness across all ports. Linecards contains up to 18GB of packet memory for approximately 100msec of traffic buffer per ingress port and virtually eliminating packet drops in congestion scenarios. Linecards connects to all fabric modules in a nonblocking full mesh. The Arista 7508 and 7504 chassis can be populated with any combination of line cards. For environments requiring the highest performance combined with scalability a range of speeds and density options is available. This addresses the requirement for dense 1/10G, 40G and 100G with full support for industry standard connections and comprehensive layer 2 and 3 features for flexible deployment choice. Embedded optics are combined with MPO interfaces to provide a multi-speed port (MXP) capability that increases system density with a choice of 10G/40G/100G interfaces. MXP ports support a mix and match option of 12 x 10G, 3 x 40G or 1x 100G per port. With support for up to 450m over multi-mode fiber the MXP ports provide a high-density solution and seamless migration from 10G to 100G without replacing transceivers or lowering system density. 12 port 100GbE SR10 MXP linecard with embedded optics o Maximum 96 100GbE, 288 40GbE and 1,152 10GbE ports o Up to 300/450m on OM3/OM4 for standards compliant 10G, 40G and 100G 36 port QSFP+ 40G linecard for 10G/40G o 288 40GbE or 1,152 10GbE ports with QSFP+ optics and breakout cables o Choice of Copper, Multimode and Single-mode with 40G and 10G options 48 port SFP+ for 1/10GbE and 2 port 100GbE SR10 MXP linecard o Up to 72 10G ports per linecard or 48 1/10GbE ports and flexible 40G/100G o Two MXP ports allow choice of 2 x 100GbE, 6 x 40GbE or 24 x 10GbE 48 port SFP+ linecard for wire-speed 1/10GbE and consistent features o Dense 10G with deep buffers o Broadest range of 1GbE and 10GbE transceivers and copper cables Arista 7280E Series The Arista 7280E Series extends the Arista 1RU product portfolio, providing a combination of deep buffers and the industry s first 100GbE Top of Rack switch combined with extensive ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 23

features such as VXLAN and LANZ. The 7280E are built for storage networks, content delivery networks, and lossless spline/leaf datacenter designs. The 7280E Series are available in three models each with 48 SFP+ ports for 1/10GbE and a choice of 40GbE and 100GbE uplinks. The 7280SE-64 has four QSFP+ uplink ports that allow a choice of four 40GbE or up to 16 additional 10GbE ports with the use of transceivers or cables. The 7280SE-72 delivers two 100GbE uplinks through the use of Arista MXP interfaces and embedded optics. Each MXP port enables twelve 10GbE, three 40GbE or one 100GbE for a wide choice of cost effective connections. The 7280SE-68 has two 100GbE QSFP uplinks that allows for the use of both 100GbE and 40GbE optics for the widest range of both short and long reach connection options, active and passive cables. All models in the 7280E Series deliver rich layer 2 and layer 3 features with wire speed performance up to 1.44 Terabits per second. The Arista 7280E Series offer a virtual output queue architecture combined with an ultra-deep 9GB of packet buffers that eliminates head of line blocking and allows for lossless forwarding under sustained congestion and the most demanding application loads. Combined with Arista EOS the 7280E Series delivers advanced features for HPC, big data, content delivery, cloud and virtualized environments. Figure 9: 7280E Series Switches ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 24

Santa Clara Corporate Headquarters 5453 Great America Parkway Santa Clara, CA 95054 Tel: 408-547-5500 www.arista.com Ireland International Headquarters 4130 Atlantic Avenue Westpark Business Campus Shannon Co. Clare, Ireland Singapore APAC Administrative Office 9 Temasek Boulevard #29-01, Suntec Tower Two Singapore 038989 Copyright 2014 Arista Networks, Inc. All rights reserved. CloudVision, and EOS are registered trademarks and Arista Networks is a trademark of Arista Networks, Inc. All other company names are trademarks of their respective holders. Information in this document is subject to change without notice. Certain features may not yet be available. Arista Networks, Inc. assumes no responsibility for any errors that may appear in this document. 10/14 ARISTA DESIGN GUIDE BIG DATA APPLICATIONS DESIGN GUIDE 25