HP Big Data Reference Architecture for Apache Spark
|
|
|
- Camron Fisher
- 10 years ago
- Views:
Transcription
1 HP Reference Architectures HP Big Data Reference Architecture for Apache Spark Using the HP Apollo 4200 server, Apollo 2000 system, and Hortonworks HDP 2.x Table of contents Executive summary... 2 Introduction... 2 Solution overview... 3 Benefits of HP BDRA for Apache Spark... 5 Solution components... 6 Minimum configuration... 7 Networking Capacity and sizing Expanding the base configuration Guidelines for calculating storage needs Configuration guidance Setting up HDP Configuring compression Bill of materials Summary Implementing a proof-of-concept Appendix A: Alternate compute node components Appendix B: Alternate storage node components Appendix C: HP value added services and support For more information... 27
2 Executive summary This white paper describes a new solution deploying the capabilities of Apache Spark (Spark) on Hortonworks Data Platform (HDP), introduces Apollo 2000 compute system and the HP Apollo 4200 storage nodes as valuable components of the solution, highlights recognizable benefits of using Spark in the HP Big Data Reference Architecture (HP BDRA), and provides guidance on selecting the appropriate configuration for a Spark-focused Hadoop cluster on HDP to meet particular business needs. In addition to simplifying the procurement process, the paper also provides guidelines for configuring HDP once the system has been deployed. There is an ever-growing need for a scalable, modern architecture for the consolidation, storage, access, and processing of big data analytics. Big data solutions using Hadoop are evolving from a simple model where each application was deployed on a dedicated cluster of identical nodes. Analytics and processing engines themselves have grown from only MapReduce to a very broad set now including processing engines like Spark. By integrating the significant advances that have occurred in fabrics, storage, container-based resource management, and workload-optimized servers since the inception of the Hadoop architecture in 2005, HP BDRA provides a cost-effective and flexible solution to optimize computing infrastructure in response to these ever-changing requirements in the Hadoop ecosystem, in this implementation, for Apache Spark. These technologies and how they are configured are described here in detail During performance testing with Hortonworks HPD 2.3 running Spark in YARN client mode on HP BDRA cluster, Spark performed more than 10% better than MapReduce jobs for TeraSort type workloads. The full spectrum of compute resources, such as cores and memory, as well as PageRank type works, performed exceptionally well. Spark SQL is particularly well-suited in a procedural context when SQL code is combined within Scala, Java or Python programs. The combination of HDP 2.3 running Spark on YARN on HP BDRA provides the ideal blend of speed, flexibility, scalability and optimization of operations needed by today s businesses. See the section Benefits of Apache Spark on HP BDRA implementation in this paper for more on the advantages of this solution. Target audience: This paper is intended for decision makers, system and solution architects, Hadoop administrators, system administrators and experienced users that are interested in reducing design time for or simplifying the purchase of a big data system deploying Spark. An intermediate knowledge of Apache Hadoop and scale-out infrastructure is recommended. Document purpose: This document describes the optimal way to configure Hortonworks HDP 2.3 running Spark in YARN mode on the HP Big Data Reference Architecture. This reference architecture describes the solution testing performed July-Sept. of 2015 jointly between Hortonworks and HP engineers. Introduction As companies grow their big data implementations, they often find themselves deploying multiple clusters to support their needs. This could be to support different big data environments (MapReduce, Spark, NoSQL DBs, MPP DBMSs, etc.), to support rigid workload partitioning for departmental requirements, or simply as a byproduct of multi-generational hardware. This often leads to data duplication and movement of large amounts of data between systems to address an organization s business requirements. Many enterprise customers are searching for a way to recapture some of the traditional benefits of a converged infrastructure such as the ability to easily share data between different applications running on different platforms, the ability to scale compute and storage separately, and the ability to rapidly provision new servers without repartitioning data to achieve optimal performance. To address these needs, HP engineers challenged the traditional Hadoop architecture which always co-locates compute elements in a server with data. While this approach works, the real power of Hadoop is that tasks run against specific slices of data without the need for coordination. Consequently, the distributed lock management and distributed cache management found in older parallel database designs is no longer required. In a modern server, there is often more network bandwidth available to ship data off the server than there is bandwidth to disk. The HP BDRA for Spark is deployed as an asymmetric cluster with some nodes dedicated to computation and others dedicated to the Hadoop Distributed File System (HDFS). Hadoop still works the same way tasks still have complete ownership of their portion of the data (functions are still being shipped to the data) but the computation is performed on a node optimized for this work and file system operations are executed on a node that is similarly optimized for that work. Of particular interest is that this approach can actually perform better than a traditional Hadoop cluster. For more information on this architecture and the benefits that it provides please see the overview document at 2
3 Solution overview HP Big Data Reference Architecture (BDRA) The HP Big Data Reference Architecture is a highly optimized configuration built using unique servers offered by HP the HP Apollo 4200 Gen9 for the high density storage layer and HP Apollo 2000 System with ProLiant XL170r nodes for the high density computational layer. This configuration is the result of a great deal of testing and optimization done by HP engineers resulting in the right set of software, drivers, firmware and hardware to yield extremely high density and performance. Simply deploying Hadoop onto a collection of traditional servers in an asymmetric fashion will not yield the kind of benefits that is seen with HP BDRA. In order to simplify the build for customers, HP provides the exact bill of materials to allow a customer to purchase this complete solution. HP Technical Services Consulting can then, recommended but at customer option, install the prebuilt operating system images, verify all firmware and versions are correctly installed, and run a suite of tests that verify that the configuration is performing optimally. Once this has been done, the customer can perform a standard Hortonworks installation using the recommended guidelines in this document. Special tuning is required and performed for Spark to work optimally in the BDRA. To facilitate installation, HP has developed a broad range of Intellectual Property (IP) that allows solutions to be implemented by HP or, jointly, with partners and customers. Additional assistance is available from HP Technical Services. For more information, refer to Appendix C: HP value added services and support. Figure 1. HP BDRA, changing the economics of work distribution in big data The HP BDRA design is anchored by the following HP technologies. Storage nodes HP Apollo 4200 Gen9 Servers with 28 LFF HDDs make up the HDFS storage layer, providing a single repository for big data. Compute nodes HP Apollo 2000 System nodes deliver a scalable, high-density layer for compute tasks and provide a framework for workload-optimization with four ProLiant XL170r nodes on a single chassis. High-speed networking separates compute nodes and storage nodes, creating an asymmetric architecture that allows each tier to be scaled individually; there is no commitment to a particular CPU/storage ratio. Since big data is no longer collocated with storage, Hadoop does need to achieve node locality. However, rack locality works in exactly the same way as in a traditional converged infrastructure; that is, as long as one scales within a rack, overall scalability is not affected. With compute and storage de-coupled, many of the advantages of a traditional converged system can be enjoyed again. For example, compute and storage can be scaled independently simply by adding compute nodes or storage nodes. Testing carried out by HP indicates that most workloads respond almost linearly to additional compute resources. 3
4 Hadoop YARN YARN is a key feature of the latest generation of core Hadoop and of the HP BDRA. It decouples computational resource management, like MapReduce and Spark, and scheduling capabilities from the data processing components, like HDFS, to allow Hadoop to support more varied workloads and a broader array of applications. New to YARN is the concept of labels for groupings of compute nodes. Jobs submitted through YARN can now be flagged to perform their work on a particular set of nodes when the appropriate label name is included with the job. Thus, an organization can now create groups of compute resources that are designed, built, or optimized for particular types of work, allowing for jobs to be passed out to subsets of the hardware that are optimized to the type of computation. In addition, the use of labels allows for isolating compute resources that may be required to perform a high-priority job, for example, ensuring that sufficient resources are available at a given time. Apache Spark Apache Spark is a fast general engine for large-scale data processing. It consists of Spark Core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for scalable distributed application development. Additional libraries, built atop the core, allow diverse workloads for streaming, SQL and machine learning. Spark s key advantage is speed, with an advanced DAG execution engine that supports cyclic data flow and inmemory computing; hence, it runs programs much faster than Hadoop/MapReduce. It offers ease of use to write applications quickly in Java, Scala, Python and R in addition to providing interactive modes from Scala, Python and R shells. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX and Spark Streaming. These libraries can be combined seamlessly in the same application. Spark jobs executed on YARN client/cluster mode is ideally suited for generic Spark workloads, although it can be run standalone, clustermode or on Apache Mesos with ability to access data in HDFS, HBase, Hive, Tachyon and any Hadoop data source. Spark can be considered an alternative to Hadoop MapReduce. It complements Hadoop as it provides a unified solution to manage various big data use cases and requirements. Spark processing can be combined with SparkSQL, Machine Learning and Spark Streaming, all in the same program. Spark on YARN Spark on YARN is an optimal way to schedule and run Spark jobs on a Hadoop cluster alongside a variety of other dataprocessing frameworks, leveraging existing clusters using queue placement policies, and enabling security by running on Kerberos-enabled clusters. Hortonworks Data Platform Hortonworks Data Platform (HDP) is a platform for multi-workload data processing. It can utilize a range of processing methods from batch to interactive and real-time all supported by solutions for governance, integration, security, and operations. HDP integrates with and augments solutions like HP BDRA for Spark, allowing you to maximize the value of big data. HDP enables Open Enterprise Hadoop, a full suite of essential Hadoop capabilities in the following functional areas: data management, data access, data governance and integration, security, and operations. For more information on HDP, refer to hortonworks.com/hdp. Spark is part of the HDP and is certified as YARN Ready. Memory and CPU-intensive Spark-based applications can coexist with other workloads deployed in a YARN-enabled cluster as shown in Figure 2. This approach avoids the need to create and manage dedicated Spark clusters and allows for more efficient resource use within a single cluster. Figure 2. HDP on Spark integration for Enterprise Hadoop 4
5 Hortonworks approached Spark the same way as other data access engines like Storm, Hive, and HBase: outline a strategy, rally the community, and contribute key features within the Apache Software Foundation s process. Below is a summary of the various integration points that make Spark enterprise-ready. Support for the ORC file format As part of the Stinger Initiative, the Hive community introduced the Optimized Row Columnar (ORC) file format. ORC is a columnar storage format that is tightly integrated with HDFS and provides optimizations for both read performance and data compression. It is rapidly becoming the defacto storage format for Hive. HDP provides basic support of ORC File in Spark. Security Many customer s common scenario is to run Spark-based applications alongside other applications in a secure Hadoop cluster and leverage authorization offered by HDFS. By making Spark run on Kerberos-enabled clusters, only authenticated users can submit Spark Jobs. Operations Hortonworks streamlines operations for Spark through the 100% open source Apache Ambari. Customers use Ambari to provision, manage and monitor their HDP clusters. With Ambari Stacks, Spark component(s) and services can be managed by Ambari so that you can install, start, stop and configure to fine-tune a Spark deployment all via a single interface that is used for all engines in your Hadoop cluster. Improved reliability and scale of Spark-on-YARN The Spark API is used by developers to create both iterative and inmemory applications on Hadoop YARN for efficient cluster resource usage. With Dynamic executor Allocation on YARN, Spark only uses Executors within a bound, hence improving efficiency of cluster resources. YARN ATS integration Spark has been integrated with the YARN Application Timeline Server (ATS), this provides generic storage and retrieval of the application s current and historic information. This permits a common integration point for certain classes of operational information and metrics. With this integration, the cluster operator can take advantage of information already available from YARN to gain additional visibility into the health and execution status of the Spark jobs. Benefits of HP BDRA for Apache Spark While the most obvious benefits center on density and price/performance, there are other benefits of running Spark on HP BDRA. The Apollo 2000 system with ProLiant XL170r nodes and a maximum capacity of 512GB per node, provides the flexibility to scale memory based on the Spark workload. The local SSDs on the ProLiant XL170r servers provide ideal speed and flexibility for Shuffle operations during a Spark job. Local storage can scale up to 6 SFF disks on each node. Elasticity HP BDRA is designed for flexibility. Compute nodes can be allocated very flexibly without redistributing data; for example, nodes can be allocated by time-of-day or even for a single job. The organization is no longer committed to yesterday s CPU/storage ratios, leading to much more flexibility in design and cost. Moreover, with HP BDRA, the system is only grown where needed. Since Spark is run on YARN, specific Spark jobs can be run in tiered compute node by making use of YARN labels. Consolidation HP BDRA is based on HDFS (Hadoop Distributed File System), which optimally performs and scales to large capacities for multiple sources for big data within any organization. Various pools of data currently being used in big data projects can be consolidated into a single, central repository. YARN-based Spark workloads access the big data directly via HDFS; other workloads can access the same data via appropriate connectors. Workload-optimization There is no one go-to software for big data; instead there is a federation of data management tools. After selecting the appropriate tool to meet organizational requirements, run the job using the compute nodes that are best suited for the workload, such as compute-intense ProLiant XL170r nodes on Apollo r2000 chassis. Enhanced capacity management Compute nodes can be provisioned on the fly, while storage nodes now constitute a smaller subset of the cluster and, as such, are less costly to overprovision. In addition, managing a single data repository rather than multiple different clusters reduces overall management costs. Spark compute operations are isolated to the compute nodes. Faster time-to-solution Processing big data typically requires the use of multiple data management tools. When these tools are deployed on conventional Hadoop clusters with dedicated often fragmented copies of the data, time-tosolution can be lengthy. With HP BDRA, data is unfragmented and consolidated in a single data lake, allowing different tools to access the same data. Spark running on YARN provides mechanisms to share this data using YARN resource management. Thus, more time can be spent on analysis, less on shipping data; time-to-solution is typically faster. The architecture of Spark provides the ability to complete tasks quickly on efficient computing clusters like the HP BDRA. It takes advantage of raw compute power of the CPUs and memory in compute nodes and isolates disk-based read/write operations to storage nodes. In addition, the local SSD disks on compute nodes provide fast I/O for intermediate data stores. Of the various ways to run Spark applications, Spark on YARN mode is best suited to run Spark jobs, as it utilizes cluster resources better on an HP BDRA cluster with large memory size and CPU cores on compute nodes. Spark on HP BDRA is 5
6 configured to run as Spark on YARN mode, both client and cluster modes. In YARN-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client exits once the application is initiated. In YARN-clientmode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. Typically, most development is done in YARN-client mode and uses Spark interactively. YARN-cluster mode is ideal for production workloads. Running Spark on YARN requires a binary distribution of Spark which is built with YARN support. The Spark version with HDP comes with YARN support prebuilt. Spark is able to take advantage of the separate compute and storage in HP BDRA; Spark on YARN mode utilizes the NodeManagers running on the ProLiant XL170r compute nodes for distributed task execution, and data nodes running on Apollo 4200 storage nodes for distributed data storage on HDFS. Memory for each Spark job is application-specific and scaled by configuring executor memory in proportion to the number of partitions and cores per executors. It is recommended that 75% of memory be allocated for Spark, leaving the rest for OS and buffer cache. Memory usage is greatly affected by storage level and serialization format. With large memory capacity of Apollo 2000 systems with ProLiant XL170r nodes 128GB/256GB/512GB, multiple JVMs are run to facilitate more tasks per node, thus speeding up Spark jobs. The local SSDs on ProLiant XL170r compute nodes help Spark to store data that does not fit in memory and preserve intermediate output between stages of a Spark job. Spark scales well to tens of CPU cores per machine because it performs minimal sharing between threads. When data is in memory, many Spark applications are network bound. With 40 Gb network bandwidth on storage nodes, HP BDRA provides fast application performance for distributed reduce applications such as group-bys, reduce-bys and SQL joins. For TeraSort type workloads, the Spark intermediate files and memory spill are written to the fast SSD disks locally on each of the compute nodes. An optimized configuration of Spark on HP BDRA, as provided in the Configuration guidance section of this white paper, utilizes CPU and memory effectively to run parallel Spark tasks quickly, thus delivering excellent performance on HP BDRA. During performance testing with Hortonworks HPD 2.3 running Spark on YARN on HP BDRA, Spark performed more than 10% better than MapReduce jobs for TeraSort type workloads. The full spectrum of compute resources, such as cores and memory, as well as PageRank type works, performed exceptionally well; and Spark SQL is particularly well-suited in a procedural context when SQL code is combined within Scala, Java or Python programs. Solution components Figure 3 provides a basic conceptual diagram of HP BDRA for Spark..Figure 3. HP BDRA for Spark solution overview For full BOM listings of products selected for the proof of concept, refer to the Bill of materials section of this white paper. 6
7 Note When running Spark the following functions are added to the existing two head nodes. Spark history server Spark timeline server Recommended to run Hive Metastore, Hiveserver2, MySQL Server, and WebHCat server on head nodes. Minimum configuration Figure 4 shows the minimum recommended HP BDRA for Spark configuration, with 12 compute nodes housed in three 2U chassis and four 2U storage nodes with 28 LFF disks each. Figure 4. Base HP BDRA for Spark configuration, with detailed lists of components Best practice HP recommends starting with three Apollo 2000r Chassis consisting of 12 XL170r nodes with 256GB memory for each compute node, paired with four Apollo 4200 Gen9 Servers for storage nodes. The following nodes are used in the base HP BDRA for Spark configuration. 7
8 Compute nodes The base HP BDRA configuration features three HP Apollo 2000r Chassis containing a total of 12 HP ProLiant XL170r Gen9 hot-pluggable server nodes as shown in Figures 5 and 6. Figure 5. HP ProLiant XL170r Gen9 Figure 6. HP Apollo 2000r Chassis (back and front) The HP Apollo 2000 System is a dense solution with four independent HP ProLiant XL170r Gen9 hot-pluggable server nodes in a standard 2U chassis. Each ProLiant XL170r Gen9 Server node is serviced individually without impacting the operation of other nodes sharing the same chassis to provide increased server uptime. Each server node harnesses the performance of 2133 MHz memory (16 DIMM slots per node) and dual Intel Xeon E v3 processors in a very efficient solution that shares both power and cooling infrastructure. Support for high-performance memory (DDR4) and Intel Xeon E v3 processor up to 18C, 145W Additional PCIe riser options for flexible and balanced I/O configurations FlexibleLOM feature for additional network expansion options Support for dual M.2 drives For more information on Apollo r2000 Chassis visit For more information on HP ProLiant XL170r Gen9 server visit Each of these compute nodes typically run YARN NodeManagers. Note Alternate compute nodes: The HP ProLiant m710p server cartridge (Intel Xeon E3-1284L v4 2.9GHz-3.8GHz) housed on HP Moonshot 1500 chassis provides users with an alternate choice of compute servers to use in the HP BDRA for Spark configuration for denser compute nodes. Spark workloads make use of the large total core count and the fast local SSD disks for intermediate data during shuffle operations. The m710p cartridges come with Intel s 4 th Generation Broadwell chips. For BOM information, refer to Appendix A: Table A-4. 8
9 Management/head nodes Three HP ProLiant DL360 Gen9 servers are configured as management/head nodes with the following functionality: Management node, with Ambari, HP Insight Cluster Management Utility (Insight CMU), ZooKeeper, and JournalNode Head node 1, with active ResourceManager/standby NameNode, ZooKeeper, JobHistoryServer and JournalNode. Additional functions for Spark: SparkJobServer, Hiveserver2 and MySQL Server Head node 2, with active NameNode/standby ResourceManager, ZooKeeper, and JournalNode. Additional functions for Spark: Spark ATS, Hive Metastore and WebHCat Server. Typically Spark client is installed on the Management nodes and head nodes from where Spark client mode jobs can be started. Storage nodes There are three Apollo 4200 Gen9 servers. Figure 7 shows the HP Apollo 4200 Gen9 Server. Each node is configured with 28 LFF disks; each runs HDFS DataNode. Figure 7. HP Apollo 4200 Gen9 Server The HP Apollo 4200 Gen9 Server offers revolutionary storage density in a 2U form factor for data intensive workloads such as Apache Spark on Hadoop HDFS. The HP Apollo 4200 Gen9 Server allows you to save valuable data center space through its unique density optimized 2U form factor which holds up to 28 LFF disks and with capacity for up to 224 TB per server. It has the ability to grow your Big Data solutions with its density optimized infrastructure that is ready to scale. And there are no special racks required the Apollo 4200 fits easily into standard racks with a depth of 31.5-inches per server. For more detailed information Power and cooling When planning large clusters, it is important to properly manage power redundancy and distribution. To ensure servers and racks have adequate power redundancy, HP recommends that each Apollo 2000r chassis, each HP Apollo 4200 Gen9, and each HP ProLiant DL360 Gen9 server should have a backup power supply and each rack should have at least two Power Distribution Units (PDUs). There is additional cost associated with procuring redundant power supplies; however, the need for redundancy is less critical in larger clusters where the inherent redundancy within HDP ensures there would be less impact in the event of a failure. Best practice For each chassis, HP Apollo 4200 Gen9, and HP ProLiant DL360 Gen9 server, HP recommends connecting each of the device s power supplies to a different PDU. Furthermore, PDUs should each be connected to a separate data center power line to protect the infrastructure from a line failure. Distributing server power supply connections evenly to the PDUs while also distributing PDU connections evenly to data center power lines ensures an even power distribution in the data center and avoids overloading any single power line. When designing a cluster, check the maximum power and cooling that the data center can supply to each rack and ensure that the rack does not require more power and cooling than is available. 9
10 Networking Two IRF-bonded HP QSFP+ switches are specified in each rack for high performance and redundancy. Each provides six 40GbE uplinks that can be used to connect to the desired network or, in a multi-rack configuration, to another pair of HP QSFP+ switches that are used for aggregation. The other 22 ports along with 4x 10G SFP+ DAC cables are available for compute nodes to run Spark jobs. Figure 8. HP QSFP+ switch allocation for various node types. Note IRF-bonding on HP 5930 switches requires four 40GbE ports per switch, leaving six 40GbE ports on each switch for uplinks. QSFP+ 4x10G SFP+ DAC Cables are used to connect Compute and Management nodes to the QSFP+ ports on HP 5930 switches. ilo network A single HP 5900 switch is used exclusively to provide connectivity to HP Integrated Lights-Out (HP ilo) management ports, which run at or below 1GbE. The HP ilo network is used for system provisioning and maintenance. Cluster isolation and access configuration It is important to isolate the Spark based Hadoop cluster so that external network traffic does not affect the performance of the cluster. In addition, isolation allows the Hadoop cluster to be managed independently from its users, ensuring that the cluster administrator is the only person able to make changes to the cluster configuration. Thus, HP recommends deploying ResourceManager, NameNode, and Worker nodes on their own private Hadoop cluster subnet. Key Point Once a Spark-based Hadoop cluster is isolated, the users still need a way to access the cluster and submit jobs to it. To achieve this, HP recommends multi-homing the management node so that it can participate in both the Hadoop cluster subnet and a subnet belonging to users. Ambari is a web application that runs on the management node, allowing users to manage and configure the Hadoop cluster and view the status of jobs without being on the same subnet provided that the management node is multihomed. Furthermore, this approach allows users to shell into the management node and run Spark, Apache Pig or Apache Hive command line interfaces and submit jobs to the cluster in that way. 10
11 Staging data After the Spark based Hadoop cluster has been isolated on its own private network, you must determine how to access HDFS in order to ingest data. The HDFS client must be able to reach every Hadoop DataNode in the cluster in order to stream blocks of data on to the HDFS. HP BDRA for Spark provides the following options for staging data: Using the management node The already multi-homed management server is used to stage data. HP recommends configuring this server with eight disks to provide a sufficient amount of disk capacity to provide a staging area for ingesting data into the Hadoop cluster from another subnet. Using the ToR switches This reference architecture makes use of open ports in the ToR switches. HP BDRA is designed such that if both NICs on each storage node are used, 8 ports are used on each Apollo 2000r Chassis (two each from ProLiant XL170r nodes), and two NICs are used on each management/head node, the remaining 40GbE ports on the ToR switches can be used by multi-homed systems outside the Hadoop cluster to move data into the cluster. Note The benefit of using dual-homed management node(s) to isolate in-cluster Hadoop traffic from the ETL traffic flowing to the cluster may be debated. This enhances security; however, it may result in ETL performance/connectivity issues, since relatively few nodes are capable of ingesting data. For example, you may wish to initiate Sqoop tasks on the compute nodes to ingest data from an external RDBMS in order to maximize the ingest rate. However, this approach requires worker nodes to be exposed to the external network to parallelize data ingestion, which is less secure. Consider the options before committing to an optimal network design for your particular environment. HP recommends dual-homing the entire cluster, allowing every node to ingest data. Capacity and sizing Capacity planning for Spark workloads is use case specific, however HP BDRA for Spark with 12 compute nodes and 4 storage nodes is a good starting point for a modern Spark cluster with efficient utilization of asymmetric distributed compute and storage with large memory and network bandwidth. Spark on HP BDRA scales well to utilize higher CPU cores as there is minimal sharing between threads and more tasks can be run in parallel. For compute node memory, generally more is better. Spark jobs with massive heaps (>200GB) tend to kick off garbage collection (gc) in Java Virtual Machine that reduces performance. For Apollo 2000r with ProLiant XL170r compute nodes, 256GB memory is the sweet spot for general Spark jobs. For Spark jobs that have large shuffle operations, the local disk on compute nodes needs to be sufficient to handle the many intermediate data that are saved to local SSD disks. Accordingly it is recommend to increase the local storage capacity on XL170r compute nodes as per workload needs. Expanding the base configuration As needed, compute and/or storage nodes can be added to a base HP BDRA for Spark configuration. The minimum HP BDRA configuration can expand to include a total of 18 2U mixed chassis/nodes in a single rack. There are a number of options for configuring a single-rack HP BDRA for Spark solution, ranging from hot (with a large number of compute nodes and minimal storage) to cold (with a large number of storage nodes and minimal compute). Care should be taken to utilize ports on HP 5930 switches to balance compute and storage traffic as per Figure 8. Multi-rack configuration The single-rack HP BDRA for Spark configuration is designed to perform well as a standalone solution but also form the basis of a much larger multi-rack solution, as shown in Figure 9. When moving from a single rack to a multi-rack solution, you simply add racks without having to change any components within the base rack. A multi-rack solution assumes the base rack is already in place and extends its scalability. For example, the base rack already provides sufficient management services for a large scale-out system. 11
12 Figure 9. Multi-rack HP BDRA for Spark, extending the capabilities of a single rack Note While much of the architecture for the multi-rack solution was borrowed from the single-rack design, the architecture suggested here for HP BDRA for Spark multi-rack solutions is based on previous iterations of Hadoop testing on the HP ProLiant DL380e platform rather than on Apollo servers. It is provided here as a general guideline for designing multi-rack Spark-based Hadoop clusters. Extending networking in a multi-rack configuration For performance and redundancy, two HP QSFP+ ToR switches are specified per expansion rack. The HP QSFP+ switch includes up to six 40GbE uplinks that can be used to connect these switches to the desired network via a pair of HP QSFP+ aggregation switches. Guidelines for calculating storage needs HP BDRA for Spark cluster storage sizing requires careful planning, based on the identification of current and future storage and compute needs. The following are general considerations for data inventory: Sources of data Frequency of data Raw storage Processed HDFS storage Replication factor Default compression turned on Space for Spark intermediate files on SSD disks on compute nodes. 12
13 To calculate your storage needs, you should identify the number of TB of data needed per day, week, month, and year; and then add the ingestion rates of all data sources. It makes sense to identify storage requirements for the short-, medium-, and long-term. Another important consideration is data retention both size and duration. Which data must you keep? For how long? In addition, consider maximum fill-rate and file system format space requirements on hard drives when estimating storage size. The storage size on each compute node is equally important as it handles the temporary storage for data that does not fit in memory while running Spark jobs as well as to preserve intermediate output between various Spark stages. An Apollo 2000 System with XL170r nodes has capacity to install 6 SFF disks. Fast enterprise grade SSD disks is best choice for compute nodes to deliver optimal performance for Spark jobs. Configuration guidance The instructions provided below assume that the Spark based Hadoop cluster has already been created on an HP BDRA solution. They are intended to assist in optimizing the setup for the various HDP services on this reference architecture. The Spark configuration settings are recommended to be used as a starting point for actual deployment scenarios. Setting up HDP Management node components Management nodes contain the software shown in Table 1. Table 1. Management node base software components Software Red Hat Enterprise Linux (RHEL) 6.6 Oracle Java Development Kit (JDK) 1.7.0_80 Ambari Server Spark Client Description Recommended operating system JDK Hortonworks HDP management Used to initiate Spark Yarn-Client jobs Head node components Head nodes contain the software shown in Table 2. Table 2. Head node base software components Software Red Hat Enterprise Linux (RHEL) 6.6 Oracle Java Development Kit (JDK) 1.7.0_80 Ambari Agent Spark Client Spark Job server Description Recommended operating system JDK Hortonworks HDP management Used to initiate Spark Yarn-Client jobs (optional) Job Server for Spark applications 13
14 Compute node components Compute nodes contain the software shown in Table 3. Table 3. Compute node base software components Software Red Hat Enterprise Linux (RHEL) 6.6 Oracle Java Development Kit (JDK) 1.7.0_80 NodeManager Description Recommended operating system JDK NodeManager process for MR2/YARN Storage node components Storage nodes contain the software shown in Table 4. Table 4. Storage node base software components Software RHEL 6.6 Oracle JDK 1.7.0_80 DataNode Description Recommended operating system JDK DataNode process for HDFS The configuration tuning provided below is for a minimum HP BDRA for Spark configuration as shown in Figure 4. YARN Make the following changes in the YARN Service. Specify the amount of memory available for YARN ResourceManager: ResourceManager java heap size 1024 mb Specify the amount of physical memory that can be allocated for containers: yarn.nodemanager.resource.memory-mb mb Specify the location of YARN local log files on the Compute nodes: yarn.nodemanager.log-dirs /data/hadoop/yarn/log Specify the location of the YARN local containers on the Compute nodes: yarn.nodemanager.local-dirs /data/hadoop/yarn/local Specify the maximum heap size of the NodeManager: NodeManager Java heap 4096 mb Capacity Scheduler Specify the minimum amount of memory allocated by YARN for a container: yarn.scheduler.minimum-allocation-mb 2048 mb Specify the amount of memory for all YARN containers on a node. Since compute nodes on BDRA each have a configuration of 128GB of memory (minimum), HP recommends allocating about 112GB for YARN: yarn.scheduler.maximumn-allocation-mb mb Note HP baseline configuration is 128GB. Appropriate changes will need to be configured based on the number of GB per node deployed. 14
15 Specify the number of cores available on compute nodes: (total vcores on each node = 56) yarn.nodemanager.resource.cpu-vcores 48 The node-locality-delay specifies how many scheduling intervals to let pass attempting to find a node local slot to run on prior to searching for a rack local slot. This setting is very important for small jobs that do not have a large number of maps or reduces as it will better utilize the Compute Nodes. We highly recommend this value be set to 1. yarn.scheduler.capacity.node-locality-delay=1 Spark Hortonworks HDP 2.3 comes with Spark version by default. While this is production ready and supported, it may be worthwhile to use Hortonworks Tech Preview of Spark release of version If using Spark, make sure to set HDP version in spark-env.sh file through Ambari. If you update your file directly, every time you restart Ambari, your changes will be overwritten. spark.driver.extrajavaoptions -Dhdp.version= spark.yarn.am.extrajavaoptions -Dhdp.version= It is recommended to keep a separate spark-defaults.conf and source it during spark-submit or spark-sql cli. This way your config changes are not over-written. Example: spark-sql --properties-files /opt/hp/cs300/utils/spark-defaults-hp.conf Configuring core Spark The following tuning for Spark is based on internal testing done at HP. Most optimizations were made to get optimal TeraSort benchmarks in Spark. Section 1: Core Spark tuning Note The following values are based on the minimum HP BDRA for Spark configuration as shown in Figure 4. Appropriate changes will need to be done based upon the number of cores and memory configuration selected for compute nodes. The number of default parallelism needs to be between 1x and 2x the total number of cores on all the compute nodes. In this base HP BDRA for Spark configuration of 12 compute nodes, there would be 672 cores based on 56 vcores on each node. 1.5x the total cores is a good starting point for optimal performance. spark.default.parallelism 1008 It is optimal to assign 4-6 cores per executors. spark.executor.cores 4 Number of executors is derived by dividing total assigned YARN threads (48x12) 576 by the specified number of executor cores (4). 576/4 =144. spark.executor.instances 144 Executor memory for each executor is derived as 128 (available memory) / 4 (executors per node) = 32 (32 * 0.07) ~28GB spark.executor.memory 28G For long running processes it is recommend to provide 1-2G of memory for spark driver. spark.driver.memory 2G Your spark-submit command line arguments would appear as: --driver-memory 2G --executor-memory 28G --num-executors executorcores 4 --properties-file /opt/hp/bdra/utils/spark-defaults-hp.conf Section 2: Incremental tuning parameters: (requires further optimizations based on particular work load) In yarn-cluster mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored. In yarn-client mode, the Spark executors will use the local directories configured for YARN while the Spark driver 15
16 will use those defined in spark.local.dir. This is because the Spark driver does not run on the YARN cluster in yarnclient mode, only the Spark executors do. 1. Use local disks on each HP ProLiant XL170r Gen9 node to take advantage of SSD drives for caching. spark.local.dir /data 2. Set the number of default Partitions in the range of 1x to 2x times the number cores. spark.default.partitions 1008 These are split sizes/parallelism for processing data. 524 or 1056 are other good options. 3. Shuffle performance improvement: spark.shuffle.compress false spark.broadcast.compress true spark.io.compression.codec org.apache.spark.io.snappycompressioncodec spark.shuffle.memoryfraction 0.6 spark.shuffle.spill true spark.shuffle.spill.compress true 4. Improving serialize performance: Serialization plays an important role in the performance of any distributed application. Often, this will be the first thing you should tune to optimize a Spark application. spark.serializer org.apache.spark.serializer.kryoserializer (for thriftserver) spark.kryo.referencetracking false spark.kryoserializer.buffer Improving shuffle writes per nodes spark.shuffle.io.numconnectionsperpeer 4 spark.shuffle.file.buffer 64 Section 3: Other tuning yarn settings : hdfs config: yarn.nodemanager.resource.memory-mb mb yarn.nodemanager.resource.cpu-vcores 48 (number of available virtual cores = 56) physical CPU allocation for all containers on a node 98% maximum container size = 256 CPU scheduling = enabled dfs.namenode.handler.count 600 Spark SQL SparkSQL is ideally suited for mixed procedural jobs where SQL code is combined with Scala, Java or Python programs. In general SparkSQL command line interface is used for single user operations and adhoc queries. For multi-user SparkSQL environment it is recommended to use Thrift server connected via JDBC. Expect to see Hortonworks release Spark SQL GA in the near future. Spark SQL specific tuning To compile each query to Java bytecode on the fly turn on sql.codegen. This can improve performance for large queries, but can slow down very short queries. spark.sql.codegen true spark.sql.unsafe.enabled true Compress the in-memory columnar storage automatically. spark.sql.inmemorycolumnarstorage.compressed true 16
17 Below is the recommended batch size for columnar caching. Larger values may cause out-of-memory problems. spark.sql.inmemorycolumnarstorage.batchsize 1000 spark.sql.orc.filterpushdown true To automatically determine the number of reducers for joins and groupbys, currently in Spark SQL, you need to control the degree of parallelism post-shuffle using : spark.sql.shuffle.partitions = 1008 Method used to run TPCDS queries in spark-sql spark-sql --master yarn-client --name sparksqltpcds --driver-memory 2G -- executor-memory 48G --num-executors executor-cores 4 --database tpcds_bin_partitioned_orc_ properties-file /opt/hp/cs300/utils/sparkdefaults-hp.conf --driver-java-options "-Dlog4j.debug - Dlog4j.configuration=file:///usr/hdp/ /spark /conf/log4j.properties" Set log to WARN in log4j.properties to reduce log level. Running Thrift server and connecting to spark-sql through beeline is recommended option for multi-session testing. Configuring compression HP recommends using compression for map outputs since it reduces file size on disk and speeds up disk and network I/Os. Using Snappy compressions, set the following MapReduce parameters to compress job output files: mapreduce.output.fileoutputformat.compress=true mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.snappycodec mapreduce.map.output.compress=true Compression for Hive Set the following Hive parameters to compress the Hive output files using Snappy compression: hive.exec.compress.output=true hive.exec.orc.default.compress=snappy Bill of materials The BOMs outlined in Tables 5-13 below are based on the recommended configuration for a Single-Rack Reference Architecture with the following key components: One management node Two Head nodes 12 compute nodes (refer to Appendix A Table A-4 Alternate compute nodes) 4 storage nodes Two ToR switches One HP ilo switch Hortonworks Data Platform (HDP 2.3 with Tech Preview Spark 1.4.1) The following BOMs contain electronic license to use (E-LTU) parts. Electronic software license delivery is now available in most countries. HP recommends purchasing electronic products over physical products (when available) for faster delivery and for the convenience of not tracking and managing confidential paper licenses. For more information, please contact your reseller or an HP representative. Note Part numbers are at time of publication and subject to change. The bill of materials does not include complete support options or other rack and power requirements. If you have questions regarding ordering, please consult with your HP Reseller or HP Sales Representative for more details. 17
18 Management node and head nodes Important Table 5 provides a BOM for one HP ProLiant DL360 Gen9 server. The minimum recommended solution and the tested solution both feature one management server and two head nodes, requiring a total of three HP ProLiant DL360 Gen9 servers. Table 5. BOM for a single HP ProLiant DL360 Gen9 server Quantity Part number Description B21 HP DL360 Gen9 8SFF CTO Server L21 HP DL360 Gen9 E5-2640v3 FIO Kit B21 HP DL360 Gen9 E5-2640v3 Kit B21 HP 16GB 2Rx4 PC4-2133P-R Kit B21 HP 900GB 6G SAS 10K 2.5in SC ENT HDD B21 HP Smart Array P440ar/2G FIO Controller B21 HP Ethernet 10Gb 2-port 546FLR-SFP+ Adapter B21 HP 1U SFF Easy Install Rail Kit B21 HP 500W FS Plat Ht Plg Pwr Supply Kit B21 HP DL360 Gen9 Serial Cable B21 HP DL360 Gen9 SFF Sys Insight Dsply Kit B21 HP QSFP/SFP+ Adaptor Kit 2 AF595A HP 3.0M, Blue, CAT6 STP, Cable Data Compute nodes Important Table 6 provides a BOM for HP Apollo 2000r chassis with 4 HP ProLiant XL170r nodes. A minimum of 12 HP ProLiant XL170r nodes (3 HP Apollo 2000r chassis) are recommended for compute module. 18
19 Table 6. BOM for a single HP Apollo 2000r chassis with 4 HP ProLiant XL170r nodes Quantity Part number Description B21 HP Apollo r SFF CTO Chassis B21 HP Apollo 2000 FAN-module Kit B21 HP ProLiant XL170r Gen9 CTO Svr L21 HP XL1x0r Gen9 E5-2697v3 FIO Kit (14C, 2.6GHz) B21 HP XL1x0r Gen9 E5-2697v3 Kit B21 HP 32GB 2Rx4 PC4-2133P-R Kit B21 HP Ethernet 10Gb 2-port 546FLR-SFP+ Adapter B21 HP 480GB 6Gb SATA RI-3 SFF SC SSD B21 HP XL1x0r Gen9 LP PCIex16 L Riser Kit B21 HP XL170r FLOM x8 R Riser Kit B21 HP XL170r/190r Dedicated NIC IM Board Kit B21 HP XL170r Mini-SAS B140 Cbl Kit B21 HP 1400W FS Plat Pl Ht Plg Pwr Spply Kit B21 HP t2500 Strap Shipping Bracket B21 HP DL2000 Hardware Rail Kit 8 AF595A HP 3.0M, Blue, CAT6 STP, Cable Data 8 JG327A HP X240 40G QSFP+ QSFP+ 3m DAC Cable (2 per B21) Alternate compute nodes Appendix A: Table A-4 provides a BOM for one HP Moonshot 1500 chassis with 45 m710p server cartridges. The minimum recommended solution features one Moonshot chassis with a total of 45 m710p server cartridges. 19
20 Storage nodes Important Table 7 provides a BOM for one HP Apollo 4200 Gen9 server. The minimum recommended solution features four HP Apollo 4200 Gen9 servers. Table 7. BOM for a single HP Apollo 4200 Gen9 Server Quantity Part number Description B21 HP Apollo 4200 Gen9 24LFF CTO Svr L21 HP Apollo 4200 Gen9 E5-2660v3 FIO Kit B21 HP Apollo 4200 Gen9 E5-2660v3 Kit B21 HP 16GB 2Rx4 PC4-2133P-R Kit B21 HP Apollo 4200 Gen9 LFF Rear HDD Cage Kit B21 HP 4TB 6G SATA 7.2k 3.5in MDL LP HDD B21 HP SAS Controller Mode for Rear Storage B21 HP IB FDR/EN 40Gb 2P 544_FLR-QSFP Adapter B21 HP 800W FS Plat Hot Plug Power Supply Kit B21 HP Apollo 4200 Gen9 IM Card Kit B21 HP 120GB RI Solid State M.2 Kit B21 HP Apollo 4200 Gen9 Redundant Fan Kit B21 HP 2U Shelf-Mount Adjustable Rail Kit 2 JG327A HP X240 40G QSFP+ QSFP+ 3m DAC Cable 2 AF595A HP 3.0M, Blue, CAT6 STP, Cable Data Networking Table 8 provides a BOM for two ToR switches as featured in the tested configuration. Table 8. BOM for two HP 5930 switches (ToR) Quantity Part number Description 2 JG726A HP FF QSFP+ Switch 4 JG553A HP X712 Bck(pwr)-Frt(prt) HV Fan Tray 4 JC680A HP A58x0AF 650W AC Power Supply 2 JG329A HP X240 QSFP+ 4x10G SFP+ 1m DAC Cable 6 JG330A HP X240 QSFP+ 4x10G SFP+ 3m DAC Cable 6 JG326A HP X240 40G QSFP+ QSFP+ 1m DAC Cable 20
21 Table 9 provides a BOM for one HP ilo switch, as featured in the tested configuration. Table 9. BOM for one HP 5900 switch (HP ilo and 1GbE link) Quantity Part number Description 1 JG510A HP 5900AF-48G-4XG-2QSFP+ Switch 2 JC680A HP A58x0AF 650W AC Power Supply 2 JC682A HP 58x0AF Bck(pwr)-Frt(ports) Fan Tray Other hardware Important Quantities listed in Table 10 are based on a rack with three switches, 12 compute nodes, and four storage nodes. Table 10. BOM for a single rack with four PDUs Quantity Part number Description 1 BW908A HP mm Shock Intelligent Rack 1 BW946A HP 42U Location Discovery Kit 1 BW930A HP Air Flow Optimization Kit 1 TK821A HP CS Side Panel 1200mm Kit 1 TK816A HP CS Rack Light Kit 1 TK815A HP CS Rack Door Branding Kit 1 BW891A HP Rack Grounding Kit 4 AF520A HP Intelligent Mod PDU 24a Na/Jpn Core 6 AF547A HP 5xC13 Intlgnt PDU Ext Bars G2 Kit 2 C7536A HP Ethernet 14ft CAT5e RJ45 M/M Cable Recommended service components Table 11 provides a BOM for three recommended service components for Factory Express Build, TS Consulting and HP Big Data Reference Architecture. Table 11. Recommended service components Quantity Part number Description -- HA453A1 HP Factory Express Level 3 Service (recommended) 1 H8E04A1 HP Hadoop Custom Consulting Service (recommended) 1 P6L57A HP Big Data Reference Architecture (Required) 21
22 Software options Important Quantities listed in Table 12 may vary. Quantities below are based on a rack with three Management/Head Nodes, 12 ProLiant XL170r Compute Nodes and four Storage Nodes. Table 12. BOM for Software options Quantity Part number Description 19 E6U59ABE HP ilo Adv incl 1yr TS U E-LTU 19 QL803BAE HP Insight CMU 1yr 24x7 Flex E-LTU 19 G3J29AAE RHEL Svr 2 Sckt/2 Gst 1yr 9x5 E-LTU Hortonworks software Important Table 13 provides the BOM for the Hortonworks license and support. For the base 3 chassis 12 XL170r nodes, one license covers 1 Apollo 2000r chassis of compute nodes. For Moonshot compute option, one Hortonworks license covers up to 16 Moonshot m710p cartridges. Therefore, three of these licenses will cover a fully populated Moonshot 1500 chassis. While HP is a certified reseller of Hortonworks software subscriptions, all application support (level-one through level-three) is provided by Hortonworks. Table 13. BOM for Hortonworks software Quantity Part number Description -- F5Z52A Hortonworks Data Platform Enterprise 4 Nodes or 50TB Raw Storage 1 year 24x7 Support LTU Summary This white paper described a new implementation of the HP Big Data Reference Architecture deploying Apache Spark (Spark) on Hortonworks HDP 2.x using purpose selected HP hardware components, the Apollo 2000 compute system and the HP Apollo 4200 storage servers. Demonstrable benefits of using Spark in the HP Big Data Reference Architecture (HP BDRA) were given along with guidance on selecting the appropriate configuration for a Spark-focused Hadoop cluster to meet particular business needs. In addition to simplifying the procurement process, the paper also provided guidelines for configuring HDP and Spark once the system has been deployed to optimize performance. The HP BDRA for Spark is a unique, asymmetric Hadoop cluster architecture with some nodes dedicated to computation and others dedicated to the Hadoop Distributed File System (HDFS). Computation is performed on nodes optimized for this work and file system operations are executed on nodes that is similarly optimized for that work. For additional information on this architecture and the benefits that it provides please see the overview document at or consult the other resources listed at the end of this paper. Implementing a proof-of-concept As a matter of best practice for all deployments, HP recommends implementing a proof-of-concept using a test environment that matches as closely as possible the planned production environment. In this way, appropriate performance and scalability characterizations can be obtained. For help with a proof-of-concept, contact an HP Services representative ( or your HP partner. 22
23 Appendix A: Alternate compute node components This appendix provides BOMs for alternate processors, memory, and disk drives for the Apollo 2000 servers used as compute nodes. Table A-1. Alternate processors Apollo 2000 ProLiant XL170r Quantity per node Part number Description L21 HP XL1x0r Gen9 E5-2680v3 FIO Kit (12C, 2.5 GHz) B21 HP XL1x0r Gen9 E5-2680v3 Kit L21 HP XL1x0r Gen9 E5-2660v3 FIO Kit (10C, 2.6 GHz) B21 HP XL1x0r Gen9 E5-2660v3 Kit L21 HP XL1x0r Gen9 E5-2640v3 FIO Kit (8C, 2.6 GHz) B21 HP XL1x0r Gen9 E5-2640v3 Kit Table A-2. Alternate memory Apollo 2000 ProLiant XL170r Quantity per node Part number Description 4/8/ B21 HP 32GB 2Rx4 PC4-2133P-R Kit 8/ B21 HP 16GB 2Rx4 PC4-2133P-R Kit B21 HP 8GB 2Rx8 PC4-2133P-R Kit Table A-3. Alternate disk drives Apollo 2000 ProLiant XL170r Quantity per node Part number Description 2/4/ B21 HP 800GB 6G SATA MU-2 SFF SC SSD 2/4/ B21 HP 1.92TB 6G SATA MU-3 SFF SC SSD 2/4/ B21 HP 1.6TB 6G SATA VE 2.5in SC EV SSD 23
24 Table A-4. BOM for a single HP Moonshot chassis with 45 m710p server cartridges Quantity Part number Description B21 HP Moonshot 1500 Chassis Opt OS B21 HP 1500W Ht Plg Pwr Supply Kit B21 HP Moonshot-4QSFP+ Uplink Module Kit B21 HP Moonshot-45XGc Switch Kit B21 HP 4.3U Rail Kit B21 HP 0.66U Spacer Blank Kit B21 HP ProLiant m710p Server Cartridge B21 HP Moonshot 480G SATA VE M FIO Kit 6 JG327A HP X240 40G QSFP+ QSFP+ 3m DAC Cable 1 JG326A HP X240 40G QSFP+ QSFP+ 1m DAC Cable 3 AF595A HP 3.0M, Blue, CAT6 STP, Cable Data Appendix B: Alternate storage node components This appendix provides BOMs for alternate processors, memory, and disk drives for Apollo 4200 servers used as storage nodes. Table B-1. Alternate processors Apollo 4200 Quantity per node Part number Description L21 HP Apollo 4200 Gen9 E5-2697v3 FIO Kit (14C, 2.6GHz) B21 HP Apollo 4200 Gen9 E5-2697v3 Kit L21 HP Apollo 4200 Gen9 E5-2680v3 FIO Kit (12C, 2.5 GHz) B21 HP Apollo 4200 Gen9 E5-2680v3 Kit L21 HP Apollo 4200 Gen9 E5-2640v3 FIO Kit (8C, 2.6 GHz) B21 HP Apollo 4200 Gen9 E5-2640v3 Kit Table B-2. Alternate memory Apollo 4200 Quantity per node Part number Description B21 HP 32GB 2Rx4 PC4-2133P-R Kit B21 HP 4GB 1Rx8 PC4-2133P-R Kit B21 HP 8GB 2Rx8 PC4-2133P-R Kit 24
25 Table B-3. Alternate disk drives Apollo 4200 Quantity per node Part number Description B21 HP 6TB 6G SATA 7.2K 3.5in LP MDL HDD B21 HP 3TB 6G SATA 7.2k 3.5in MDL LP HDD B21 HP 8TB 6G SATA 7.2k 3.5in MDL LP HDD Appendix C: HP value added services and support In order to help you jump-start your Hadoop solution development, HP offers a range of big data services, which are outlined in this appendix. Factory Express Services Factory-integration services are available for customers seeking a streamlined deployment experience. With the purchase of Factory Express services, your Hadoop cluster will arrive racked and cabled, with software installed and configured per an agreed-upon custom statement of work, for the easiest deployment possible. You should contact HP Technical Services for more information and for assistance with a quote. Technical Services Consulting Reference Architecture Implementation Service for Hadoop (Hortonworks) With HP Reference Architecture Implementation Service for Hadoop, HP can install, configure, deploy, and test a Hadoop cluster that is based on HP BDRA. Experienced consultants implement all the details of the original Hadoop design: naming, hardware, networking, software, administration, backup, disaster recovery, and operating procedures. Where options exist, or the best choice is not clear, HP works with you to configure the environment to meet your goals and needs. HP also conducts an acceptance test to validate that the system is operating to your satisfaction. Technical Services Consulting Big Data Services HP Big Data Services can help you reshape your IT infrastructure to corral increasing volumes of data from s, social media, and website downloads and convert them into beneficial information. These services encompass strategy, design, implementation, protection, and compliance. Delivery is in the following three steps: 1. Architecture strategy: HP defines the functionalities and capabilities needed to align your IT with your big data initiatives. Through transformation workshops and roadmap services, you ll learn to capture, consolidate, manage and protect business-aligned information, including structured, semi-structured, and unstructured data. 2. System infrastructure: HP designs and implements a high-performance, integrated platform to support a strategic architecture for big data. Choose from design and implementation services, reference architecture implementations, and integration services. Your flexible, scalable infrastructure will support big data variety, consolidation, analysis, share, and search on HP platforms. 3. Data protection: Ensure the availability, security, and compliance of your big data systems. HP can help you safeguard your data and achieve regulatory compliance and lifecycle protection across your big data landscape, while also enhancing your approach to backup and business continuity. For additional information, visit hp.com/services/bigdata. HP Support options HP offers a variety of support levels to meet your needs. More information is provided below. HP Support Plus 24 HP can provide integrated onsite hardware/software support services, available 24x7x365, including access to HP technical resources, four-hour response onsite hardware support and software updates. HP Proactive Care HP Proactive Care provides all of the benefits of proactive monitoring and reporting, along with rapid reactive support through HP s expert reactive support specialists. You can customize your reactive support level by selecting either six-hour call-to-repair or 24x7 with four-hour onsite response. 25
26 HP Proactive Care helps prevent problems, resolve problems faster, and improve productivity. Through analysis, reports, and update recommendations, you are able to identify and address IT problems before they can cause performance issues or outages. HP Proactive Care with the HP Personalized Support Option Adding the Personalized Support Option for HP Proactive Care is highly recommended. This option builds on the benefits of HP Proactive Care Service, providing you an assigned Account Support Manager who knows your environment and can deliver support planning, regular reviews, and technical and operational advice specific to your environment. HP Proactive Select To address your ongoing/changing needs, HP recommends adding Proactive Select credits to provide tailored support options from a wide menu of services that can help you optimize the capacity, performance, and management of your environment. These credits may also be used for assistance in implementing solution updates. As your needs change over time, you have the flexibility to choose the services best suited to address your current challenges. HP Datacenter Care HP Datacenter Care provides a more personalized, customized approach for large, complex environments, providing a single solution for reactive, proactive, and multi-vendor support needs. You may also choose the Defective Media Retention (DMR) option. Other offerings HP highly recommends HP Education Services (customer training and education) and additional Technical Services, as well as in-depth installation or implementation services when needed. More information For additional information, visit: HP Education Services: HP Technology Consulting Services: hp.com/services/bigdata HP Deployment Services: hp.com/services/deployment 26
27 For more information Hortonworks: hortonworks.com Hortonworks partner site: hortonworks.com/partner/hp/ Hortonworks HDP 2.3: HP Solutions for Apache Hadoop: hp.com/go/hadoop HP Insight Cluster Management Utility (Insight CMU): hp.com/go/cmu HP 5930 Switch Series: hp.com/networking/5930 HP ProLiant servers: hp.com/go/proliant HP Apollo 2000 System: hp.com/go/apollo2000 HP Apollo Systems: hp.com/go/apollo QuickSpecs for Apollo 2000 System QuickSpecs for HP ProLiant XL170r Gen9 Server HP Moonshot system: hp.com/go/moonshot HP Enterprise Software: hp.com/go/software HP Networking: hp.com/go/networking HP Integrated Lights-Out (HP ilo): hp.com/servers/ilo HP Product Bulletin (QuickSpecs): hp.com/go/quickspecs HP Services: hp.com/go/services HP Support and Drivers: hp.com/go/support HP Systems Insight Manager (HP SIM): hp.com/go/hpsim To help us improve our documents, please provide feedback at hp.com/solutions/feedback. About Hortonworks Hortonworks develops, distributes and supports the only 100% open source Apache Hadoop data platform. Our team comprises the largest contingent of builders and architects within the Hadoop ecosystem who represent and lead the broader enterprise requirements within these communities. The Hortonworks Data Platform provides an open platform that deeply integrates with existing IT investments and upon which enterprises can build and deploy Hadoop-based applications. Hortonworks has deep relationships with the key strategic data center partners that enable our customers to unlock the broadest opportunities from Hadoop. For more information, visit hortonworks.com. Sign up for updates hp.com/go/getupdated Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Red Hat is a registered trademark of Red Hat, Inc. in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. Intel and Xeon are trademarks of Intel Corporation in the U.S. and other countries. 4AA6-2682ENW, October 2015
HP Big Data Reference Architecture: Hortonworks Data Platform reference architecture implementation
Technical white paper HP Big Data Reference Architecture: Hortonworks Data Platform reference architecture implementation HP Big Data Reference Architecture with Hortonworks Data Platform for Apache Hadoop
HP Verified Reference Architecture BDRA and Hortonworks Data Platform implementation
HP Reference Architectures HP Verified Reference Architecture BDRA and Hortonworks Data Platform implementation New storage developments on BDRA with Apollo 4510 Table of contents Executive summary...
HP Verified Reference Architecture for running HBase on HP BDRA
HP Reference Architectures HP Verified Reference Architecture for running HBase on HP BDRA Configuration guide for running HBase on HP Big Data Reference Architecture Table of contents Executive summary...
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack HIGHLIGHTS Real-Time Results Elasticsearch on Cisco UCS enables a deeper
HP Reference Architecture for Hortonworks Data Platform on HP ProLiant SL4540 Gen8 Server
Technical white paper HP Reference Architecture for Hortonworks Data Platform on HP Server HP Converged Infrastructure with the Hortonworks Data Platform for Apache Hadoop Table of contents Executive summary...
Platfora Big Data Analytics
Platfora Big Data Analytics ISV Partner Solution Case Study and Cisco Unified Computing System Platfora, the leading enterprise big data analytics platform built natively on Hadoop and Spark, delivers
HP Reference Architecture for Cloudera Enterprise on ProLiant DL Servers
Technical white paper HP Reference Architecture for Cloudera Enterprise on ProLiant DL Servers HP Converged Infrastructure with Cloudera Enterprise for Apache Hadoop Table of contents Executive summary...
HP Big Data Reference Architecture: A Modern Approach
Technical white paper HP Big Data Reference Architecture: A Modern Approach HP BDRA with Apollo 2000 System and Apollo 4200 Servers Table of contents Executive summary... 2 Introduction... 2 Breaking the
How To Write An Article On An Hp Appsystem For Spera Hana
Technical white paper HP AppSystem for SAP HANA Distributed architecture with 3PAR StoreServ 7400 storage Table of contents Executive summary... 2 Introduction... 2 Appliance components... 3 3PAR StoreServ
HP Reference Architecture for Cloudera Enterprise
Technical white paper HP Reference Architecture for Cloudera Enterprise HP Converged Infrastructure with Cloudera Enterprise for Apache Hadoop Table of contents Executive summary 2 Cloudera Enterprise
Configuring the HP DL380 Gen9 24-SFF CTO Server as an HP Vertica Node. HP Vertica Analytic Database
Configuring the HP DL380 Gen9 24-SFF CTO Server as an HP Vertica Node HP Vertica Analytic Database HP Big Data Foundations Document Release Date: February, 2015 Contents Using the HP DL380 Gen9 24-SFF
Dell Reference Configuration for Hortonworks Data Platform
Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution
HP recommended configuration for Microsoft Exchange Server 2010: ProLiant DL370 G6 supporting 1000-2GB mailboxes
HP recommended configuration for Microsoft Exchange Server 2010: ProLiant DL370 G6 supporting 1000-2GB mailboxes Table of contents Executive summary... 2 Introduction... 3 Tiered solution matrix... 3 Recommended
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
HP recommended configurations for Microsoft Exchange Server 2013 and HP ProLiant Gen8 with direct attached storage (DAS)
HP recommended configurations for Microsoft Exchange Server 2013 and HP ProLiant Gen8 with direct attached storage (DAS) Building blocks at 1000, 3000, and 5000 mailboxes Table of contents Executive summary...
HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture
WHITE PAPER HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture Based on Microsoft SQL Server 2014 Data Warehouse
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief
Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...
SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION
SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION AFFORDABLE, RELIABLE, AND GREAT PRICES FOR EDUCATION Optimized Sun systems run Oracle and other leading operating and virtualization platforms with greater
HP reference configuration for entry-level SAS Grid Manager solutions
HP reference configuration for entry-level SAS Grid Manager solutions Up to 864 simultaneous SAS jobs and more than 3 GB/s I/O throughput Technical white paper Table of contents Executive summary... 2
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
HP Moonshot System. Table of contents. A new style of IT accelerating innovation at scale. Technical white paper
Technical white paper HP Moonshot System A new style of IT accelerating innovation at scale Table of contents Abstract... 2 Meeting the new style of IT requirements... 2 What makes the HP Moonshot System
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
HP high availability solutions for Microsoft SQL Server Fast Track Data Warehouse using SQL Server 2012 failover clustering
Technical white paper HP high availability solutions for Microsoft SQL Server Fast Track Data Warehouse using SQL Server 2012 failover clustering Table of contents Executive summary 2 Fast Track reference
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
HP recommended configuration for Microsoft Exchange Server 2010: HP LeftHand P4000 SAN
HP recommended configuration for Microsoft Exchange Server 2010: HP LeftHand P4000 SAN Table of contents Executive summary... 2 Introduction... 2 Solution criteria... 3 Hyper-V guest machine configurations...
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Upcoming Announcements
Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC [email protected] Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how EMC Elastic Cloud Storage (ECS ) can be used to streamline the Hadoop data analytics
HP Client Virtualization SMB Reference Architecture for Windows Server 2012
Technical white paper HP Client Virtualization SMB Reference Architecture for Windows Server 212 Affordable, high-performing desktop virtualization on HP ProLiant DL38p Gen8 and Windows Server 212 Table
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Intel RAID SSD Cache Controller RCS25ZB040
SOLUTION Brief Intel RAID SSD Cache Controller RCS25ZB040 When Faster Matters Cost-Effective Intelligent RAID with Embedded High Performance Flash Intel RAID SSD Cache Controller RCS25ZB040 When Faster
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Integrated Grid Solutions. and Greenplum
EMC Perspective Integrated Grid Solutions from SAS, EMC Isilon and Greenplum Introduction Intensifying competitive pressure and vast growth in the capabilities of analytic computing platforms are driving
Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters
Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters Table of Contents Introduction... Hardware requirements... Recommended Hadoop cluster
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a
HP ConvergedSystem 900 for SAP HANA Scale-up solution architecture
Technical white paper HP ConvergedSystem 900 for SAP HANA Scale-up solution architecture Table of contents Executive summary... 2 Solution overview... 3 Solution components... 4 Storage... 5 Compute...
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference
HP ProLiant DL380p Gen8 1000 mailbox 2GB mailbox resiliency Exchange 2010 storage solution
Technical white paper HP ProLiant DL380p Gen8 1000 mailbox 2GB mailbox resiliency Exchange 2010 storage solution Table of contents Overview 2 Disclaimer 2 Features of the tested solution 2 Solution description
HP RA for SAS Visual Analytics on HP ProLiant BL460c Gen8 Servers running Linux
Technical white paper HP RA for SAS Visual Analytics on Servers running Linux Performance results with a concurrent workload of 5 light users and 1 heavy user accessing a 112GB dataset Table of contents
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
HDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
UCS M-Series Modular Servers
UCS M-Series Modular Servers The Next Wave of UCS Innovation Marian Klas Cisco Systems June 2015 Cisco UCS - Powering Applications at Every Scale Edge-Scale Computing Cloud-Scale Computing Seamlessly Extend
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2016 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
HADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage
Cisco for SAP HANA Scale-Out Solution Solution Brief December 2014 With Intelligent Intel Xeon Processors Highlights Scale SAP HANA on Demand Scale-out capabilities, combined with high-performance NetApp
Copyright 2012, Oracle and/or its affiliates. All rights reserved.
1 Oracle Big Data Appliance Releases 2.5 and 3.0 Ralf Lange Global ISV & OEM Sales Agenda Quick Overview on BDA and its Positioning Product Details and Updates Security and Encryption New Hadoop Versions
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Cost-Effective Business Intelligence with Red Hat and Open Source
Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,
Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks
WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance
2009 Oracle Corporation 1
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,
Evaluation Report: HP Blade Server and HP MSA 16GFC Storage Evaluation
Evaluation Report: HP Blade Server and HP MSA 16GFC Storage Evaluation Evaluation report prepared under contract with HP Executive Summary The computing industry is experiencing an increasing demand for
Dell s SAP HANA Appliance
Dell s SAP HANA Appliance SAP HANA is the next generation of SAP in-memory computing technology. Dell and SAP have partnered to deliver an SAP HANA appliance that provides multipurpose, data source-agnostic,
OPTIMIZING SERVER VIRTUALIZATION
OPTIMIZING SERVER VIRTUALIZATION HP MULTI-PORT SERVER ADAPTERS BASED ON INTEL ETHERNET TECHNOLOGY As enterprise-class server infrastructures adopt virtualization to improve total cost of ownership (TCO)
Lenovo ThinkServer and Cloudera Solution for Apache Hadoop
Lenovo ThinkServer and Cloudera Solution for Apache Hadoop For next-generation Lenovo ThinkServer systems Lenovo Enterprise Product Group Version 1.0 December 2014 2014 Lenovo. All rights reserved. LENOVO
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director [email protected] Dave Smelker, Managing Principal [email protected]
FLOW-3D Performance Benchmark and Profiling. September 2012
FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters
CONNECT - Lab Guide Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters Hardware, software and configuration steps needed to deploy Apache Hadoop 2.4.1 with the Emulex family
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Apache Hadoop Cluster Configuration Guide
Community Driven Apache Hadoop Apache Hadoop Cluster Configuration Guide April 2013 2013 Hortonworks Inc. http://www.hortonworks.com Introduction Sizing a Hadoop cluster is important, as the right resources
Virtualizing Apache Hadoop. June, 2012
June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
docs.hortonworks.com
docs.hortonworks.com Hortonworks Data Platform : Cluster Planning Guide Copyright 2012-2014 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively
HP ConvergedSystem 300 for Microsoft Analytics Platform
Technical white paper HP ConvergedSystem 300 for Microsoft Analytics Platform Data Integration Platform Gen9 reference guide Table of contents Executive summary... 2 Overview of services for the Data Integration
Copyright 2013, Oracle and/or its affiliates. All rights reserved.
1 Oracle SPARC Server for Enterprise Computing Dr. Heiner Bauch Senior Account Architect 19. April 2013 2 The following is intended to outline our general product direction. It is intended for information
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
Reference Architecture - Microsoft Exchange 2013 on Dell PowerEdge R730xd
Reference Architecture - Microsoft Exchange 2013 on Dell PowerEdge R730xd Reference Implementation for up to 8000 mailboxes Dell Global Solutions Engineering June 2015 A Dell Reference Architecture THIS
Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
I/O Performance of Cisco UCS M-Series Modular Servers with Cisco UCS M142 Compute Cartridges
White Paper I/O Performance of Cisco UCS M-Series Modular Servers with Cisco UCS M142 Compute Cartridges October 2015 2015 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads
HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads Gen9 Servers give more performance per dollar for your investment. Executive Summary Information Technology (IT) organizations face increasing
How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads
Solution Overview Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads What You Will Learn MapR Hadoop clusters on Cisco Unified Computing System (Cisco UCS
SUN ORACLE EXADATA STORAGE SERVER
SUN ORACLE EXADATA STORAGE SERVER KEY FEATURES AND BENEFITS FEATURES 12 x 3.5 inch SAS or SATA disks 384 GB of Exadata Smart Flash Cache 2 Intel 2.53 Ghz quad-core processors 24 GB memory Dual InfiniBand
Hortonworks Data Platform Reference Architecture
Hortonworks Data Platform Reference Architecture A PSSC Labs Reference Architecture Guide December 2014 Introduction PSSC Labs continues to bring innovative compute server and cluster platforms to market.
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp
Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp Agenda Hadoop and storage Alternative storage architecture for Hadoop Use cases and customer examples
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
HP Cloudline Overview
HP Cloudline Overview Infrastructure for Cloud Service Providers Dave Peterson Hewlett-Packard Company Copyright 2015 2012 Hewlett-Packard Development Company, L.P. The information contained herein is
Adobe Deploys Hadoop as a Service on VMware vsphere
Adobe Deploys Hadoop as a Service A TECHNICAL CASE STUDY APRIL 2015 Table of Contents A Technical Case Study.... 3 Background... 3 Why Virtualize Hadoop on vsphere?.... 3 The Adobe Marketing Cloud and
IBM System x reference architecture for Hadoop: MapR
IBM System x reference architecture for Hadoop: MapR May 2014 Beth L Hoffman and Billy Robinson (IBM) Andy Lerner and James Sun (MapR Technologies) Copyright IBM Corporation, 2014 Table of contents Introduction...
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra
Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra A Quick Reference Configuration Guide Kris Applegate [email protected] Solution Architect Dell Solution Centers Dave
Deploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions
WHITE PAPER May 2014 Deploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions Contents Executive Summary...2 Background...2 Network Configuration...3 Test
NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB
bankmark UG (haftungsbeschränkt) Bahnhofstraße 1 9432 Passau Germany www.bankmark.de [email protected] T +49 851 25 49 49 F +49 851 25 49 499 NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB,
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Enabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya
Oracle Database - Engineered for Innovation Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya Oracle Database 11g Release 2 Shipping since September 2009 11.2.0.3 Patch Set now
