Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
|
|
|
- Elisabeth Watson
- 10 years ago
- Views:
Transcription
1 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop Last update: 17 September 2015 Configuration reference number: BDAMAPRXX53 MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform MapR brings unprecedented dependability, ease-of-use and world-record speed to Hadoop Uses powerful, versatile new Lenovo System x3650 M5 server Leading security and reliability Innovative energy efficiency and performance He Da Shuang (Lenovo) Ajay Dholakia (Lenovo) James Sun (MapR)
2 Table of Contents 1 Introduction Business problem and business value Business problem Business value Requirements Functional requirements Non-functional requirements Architectural overview Component model MapR Services MapR data services MapR management services Other optional MapR services MapR Components MapR Management Server MapR File System MapR Database Open Source Projects Operational model Hardware description Lenovo System x3650 M5 Server Lenovo RackSwitch G Lenovo RackSwitch G Cluster nodes Systems management Networking Data network Hardware management network Multi-rack network ii Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
3 6.5 Predefined cluster configurations Deployment considerations Increasing cluster performance Designing for lower cost Designing for high ingest rates Estimating disk space Scaling considerations High availability considerations Networking considerations Hardware availability considerations Storage availability Software availability considerations Migration considerations Appendix: Bill of materials Node Administration / Management network switch Data network switch Rack Cables Acknowledgements Resources Trademarks and special notices iii Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
4 1 Introduction This document describes the reference architecture for the Big Data Solution based on MapR Distribution including Apache Hadoop. It provides a predefined and optimized hardware infrastructure for MapR Enterprise Database Edition, a distribution of Hadoop with value added capabilities that are produced by MapR Technologies. This reference architecture provides the planning, design considerations, and best practices for implementing MapR Enterprise Database Edition with Lenovo products. The Lenovo and MapR teams worked together on this document, and the reference architecture that is described herein was validated by both Lenovo and MapR. MapR brings the power of Hadoop to the enterprise. MapR Enterprise Database Edition is a complete Hadoop distribution supporting YARN, NoSQL, and Hadoop ecosystem workloads. MapR is the only distribution that is built from the ground up for business-critical production applications. MapR packages over 20 projects from the Hadoop open source ecosystem to provide you with a broad set of big data capabilities. The MapR platform not only provides enterprise-grade features such as high availability, disaster recovery, security, and full data protection but also allows Hadoop to be easily accessed as traditional network attached storage (NAS) with full read-write capabilities. The predefined configuration provides a baseline configuration for a big data solution which can be modified based on the specific customer requirements, such as lower cost, improved performance, and increased reliability. The intended audience of this document is IT professionals, technical architects, sales engineers, and consultants to assist in planning, designing and implementing the big data solution with Lenovo System x hardware. It is assumed that you are familiar with Apache Hadoop components and capabilities. For more information about Hadoop, see the Resources section on page Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
5 2 Business problem and business value This section describes the business problem that is associated with big data environments and the value that is offered by the MapR solution that uses Lenovo hardware. 2.1 Business problem By 2009, the world generated 800 billion GB of data a level that is expected to increase to 40 trillion GB by In all, 90% of the data in the world today was created in the last two years alone. This data comes from everywhere, including sensors that are used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone global positioning system (GPS) signals. This data is big data. Big data spans the following dimensions: Volume: Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information. Velocity: Often time-sensitive, big data must be used as it is streaming into the enterprise to maximize its value to the business. Variety: Big data extends beyond structured data, including unstructured data of all varieties, such as text, audio, video, click streams, and log files. Big data is more than a challenge; it is an opportunity to find insight into new and emerging types of data to make your business more agile. Big data also is an opportunity to answer questions that, in the past, were beyond reach. Until now, there was no effective way to harvest this opportunity. 2.2 Business value MapR provides a Hadoop-based big data solution that is easy to manage, and is dependable, and fast. MapR reduces the complexity of setting up and managing Hadoop. MapR provides alerts, alarms, and insights through an advanced graphical interface. The MapR Heatmap provides a clear view of cluster health and performance, and MapR volumes simplify data security via features such as wire level encryption, data accounting and auditing, retention, placement, and quota management. MapR provides Direct Access NFS. This allows users to mount the entire Hadoop cluster as an NFS volume that simplifies how an application can write to and read from a Hadoop cluster. MapR provides a high level of availability including support for rolling upgrades, self-healing and automated stateful failover. MapR also provides dependable data storage with full data protection and business continuity features including snapshots and mirroring. Enhanced security features in MapR provide strong user authentication and authorization. Data confidentiality and integrity is also enforced whether it is in motion or at rest in Hadoop cluster. Unique MapR functions improve MapReduce throughput. MapR deployed on Lenovo System x servers with Lenovo networking components provides superior performance, reliability, and scalability. The reference architecture supports entry through high-end configurations and the ability to easily scale as the use of big data grows. A choice of infrastructure components provides flexibility in meeting varying big data analytics requirements. 2 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
6 3 Requirements The functional and non-functional requirements for the MapR reference architecture are described in this section. 3.1 Functional requirements A big data solution supports the following key functional requirements: Various application types, including batch and real-time analytics Industry-standard interfaces, so that existing applications can work Real-time streaming and processing of data Various data types and databases Various client interfaces Large volumes of data 3.2 Non-functional requirements Customers require their big data solution to be easy, dependable, and fast. The following non-functional requirements are key: Easy: o o o o Ease of development Easy management at scale Advanced job management Multi-tenancy Dependable: o o o o Data protection with snapshot and mirroring Automated self-healing Insight into software/hardware health and issues High availability (HA) and business continuity (99.999% uptime) Fast: o o Superior performance Scalability Secure: o o o o Strong authentication and authorization Kerberos support Data confidentiality and integrity Comprehensive data security auditing 3 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
7 4 Architectural overview Figure 1 shows the main features of the MapR reference architecture which uses Lenovo hardware. Figure 1: MapR architecture overview Users can log in to the MapR client from outside the firewall by using Secure Shell (SSH) on port 22 to access the MapR solution in the corporate network. MapR provides several interfaces that allow administrators and users to perform administration and data functions depending on their roles and access level. Hadoop application programming interfaces (APIs) can be used to access data. MapR APIs can be used for cluster management and monitoring. MapR data services, management services, and other services run on the nodes in the cluster. Storage is a component of each node in the cluster. Data can be ingested into the MapR file system ( MapR-FS ) either through the Hadoop APIs or NFS depending on the needs of the customer. 4 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
8 5 Component model The MapR Distribution provides features and capabilities that meet the functional and non-functional requirements of customers. MapR is a complete distribution that includes Hadoop components such as Drill, Spark, Impala, Pig, Hive, Mahout, Cascading, Sqoop, Flume, Storm, and more. The MapR Distribution is 100% API-compatible with Hadoop (MapReduce, HDFS and HBase). The MapR Distribution has high availability and data protection features, such as ResourceManager HA, No NameNode high availability, snapshots, and mirroring. MapR supports many Linux operating systems. Red Hat Linux and SUSE Linux are supported with the Lenovo System x reference architecture. The details about the versions can be found at: doc.mapr.com/display/mapr/os+support+matrix MapR supports mission-critical and real-time big data analytics across different industries. MapR is used across financial services, retail, media, healthcare, manufacturing, telecommunications and government organizations, and leading Fortune 100 and Web 2.0 companies. The MapR Distributioncan be used for a variety of use cases from batch applications that use MapReduce or Spark with data sources such as click streams and real-time applications that use sensor data. The MapR Distributionintegrates a growing set of functions, including Spark, MapReduce, file-based applications, interactive SQL, NoSQL databases, search and discovery, and real-time stream processing. With MapR, data does not need to be moved to specialized silos for processing; data can be processed in place. This full range of applications and data sources benefit from the MapR enterprise-grade platform and unified architecture for files and tables. The MapR Distributionprovides high availability, data protection and disaster recovery to support mission-critical applications. The MapR Distributionalso makes it easier to use existing applications and solutions by supporting industry-standard interfaces, such as NFS. To support a diverse set of applications and users, MapR also provides multi-tenancy features and volume support. These features include support for heterogeneous hardware within a cluster and data and job placement control so that applications can be selectively run in a cluster to take advantage of faster processors or solid-state drives (SSDs). Figure 2 shows key components of MapR. 5 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
9 Figure 2: MapR key components The MapR product consists of the following components: MapReduce MapR provides high performance for MapReduce operations on Hadoop and publishes performance benchmarking results. The MapR architecture is built in C/C++ and harnesses distributed metadata with an optimized shuffle process, enabling MapR to deliver consistent high performance. Both the classic MapReduce and YARN frameworks are supported by MapR. File-based applications MapR is a 100% Portable Operating System Interface (POSIX) compliant system that fully supports random read-write operations. By supporting industry-standard Network File System (NFS), users can mount a MapR cluster and run any file-based application, written in any language, directly on the data residing in the cluster. All standard tools in the enterprise including browsers, UNIX tools, spreadsheets, and scripts can access the cluster directly without any modifications. SQL-on-Hadoop There are a number of applications that support SQL access against data contained in MapR including Hive, Hadapt and others. MapR is also leading the development of Apache Drill that brings American National Standards Institute (ANSI) SQL capabilities to Hadoop. Drill enables data analysts to perform self-exploratory tasks on any type of data stored in Hadoop, including JSON and Parquet files. NoSQL Database MapR has removed the trade-offs that organizations face when looking to deploy a NoSQL solution. Specifically, MapR-DB delivers ease of use, dependability, and performance advantages for HBase applications. MapR provides scalability, strong consistency, reliability, and continuous low latency with an architecture that eliminates compaction delays or background consistency corrections ( anti-entropy ). 6 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
10 Search MapR integrates enterprise-grade search with Apache Solr and Elasticsearch. On a single platform, customers can now perform predictive analytics, full search and discovery; and conduct advanced database operations. The MapR enterprise-grade search capability works directly on Hadoop data but can also index and search standard files without having to perform any conversion or transformation. Stream processing MapR provides a simplified architecture for real-time stream computational engines such as Spark Streaming, Storm. Streaming data feeds can be written directly to the MapR platform for Hadoop for long-term storage and MapReduce processing. Security MapR provides several security features at the platform-level that can be leveraged by the ecosystem projects running on MapR. For authentication, Kerberos integration is supported, as well as a native username/password mechanism. Also, Linux Pluggable Authentication Module (PAM) support lets you integrate with the widest range of user registries. For authorization, MapR Access Control Expressions (ACEs) give you fine grained permissions at the column, column family, and table levels in MapR-DB. Access to files in MapR-FS can be controlled via industry-standard UNIX permissions, though ACEs will soon also be available to provide more flexible, fine grained access control. Comprehensive auditing is available to log all administrative actions and data accesses, to analyze user behavior in a MapR cluster, as well as to achieve regulatory compliance. All data connections over the network in a MapR cluster can be encrypted ( encryption in motion ), and complementary technologies, such as LUKS and self-encrypting drives can protect data at rest, particularly to protect data in the event of the theft of physical storage devices. Other data at rest techniques such as masking, tokenization, and format-preserving encryption are available from MapR partners. 5.1 MapR Services The MapR Enterprise Database Edition provides data services, management servers, and other services MapR data services All cluster nodes can run data or worker services, which run the MapReduce tasks over a subset of the data. Cluster nodes also store the MapR File System (MapR-FS) data. MapR Data services includes the following data services: MapR-FS - Provides distributed file services. NodeManager - Manage resource on a single node and provide resources for applications MapR management services MapR management services can run on any node. In multi-rack configurations, these services should be on nodes spread across racks. Here is the list of MapR management services: Container Location Data Base (CLDB) Manages and maintains container location information and replication. ResourceManager manages and arbitrates resources among all the applications in the system. ZooKeeper Coordinates activity and keeps track of management services locations on the cluster. 7 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
11 5.1.3 Other optional MapR services MapR offers two other optional services that can be run on multiple nodes in the cluster: NFS server (Gateway) Provides NFS access to the distributed file system. The NFS server is often run on all nodes in the cluster to allow local mounting of the cluster file system from any node. WebServer Provides MapR Control System (MCS) graphical user interface (GUI) and REST API. MapR provides support for many client interfaces, several of which were described in the architecture overview or feature list. Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC) can be used to access data in the MapR Hadoop cluster. A MapR CLI provides an additional way to manage the cluster and services. 5.2 MapR Components MapR Enterprise Database Edition includes the following value-added components; Management Server, File System, and Database MapR Management Server The MapR Management Server as known as the MapR Control System (MCS) provides full visibility into cluster resources and activity. MapR Control System Heatmap: The MCS dashboard includes the MapR Heatmap that provides visual insight into node health, service status, and resource utilization, organized by the cluster topology (for example, data centers and racks). Designed to manage large clusters with thousands of nodes, the MapR Heatmap shows the health of the entire cluster at a glance. Filters and group actions are also provided to select specific components and perform administrative actions directly because the number of nodes, files, and volumes can be very high. The Heatmap interfaces are designed for managing the smallest to the largest clusters, but also include command-line interface (CLI) and Representational State Transfer (REST) access as well MapR File System MapR File System (MapR-FS) is a fully read-write distributed file system that eliminates the Namenode associated with cluster failure in other Hadoop distributions. MapR re-engineered the Hadoop Distributed File System (HDFS) architecture to provide flexibility, increase performance, and enable special features for data management and high availability. MapR No-NameNode high availability: The MapR Distribution is unique because it was designed for high availability. MapR is the only Hadoop distribution designed with no single point of failure. Other Hadoop distributions have a single primary NameNode which results in entire cluster unavailability should that NameNode fail.when other distributions are configured with standby NameNodes, the entire cluster becomes unavailable until the failover to a standby NameNode is completed. With MapR, file metadata is replicated and distributed as part of the data notes, so there is no data loss or downtime even in the face of multiple drive or node failures. MapR storage services: MapR stores data in a distributed shared system that eliminates contention and the expense from data transport and retrieval. Automatic, transparent client-side compression reduces network resources and reduces footprint on disk, while direct block device I/O provides 8 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
12 throughput at hardware speed without additional resources. Unlike other Hadoop distributions, with MapR you can read files while they are still being written. The MapR No-NameNode architecture scales linearly with the number of nodes, providing unlimited file support. You simply add nodes to increase the number of files supported to more than a trillion files containing over 1000 PB of data. MapR Direct Access NFS: MapR Direct Access NFS makes Hadoop radically easier and less expensive to use by letting the user mount the Hadoop file system from a standard NFS client. Unlike the append-only system found in other Hadoop distributions, MapR allows files to be modified and overwritten, and enables multiple concurrent reads and writes on any file. Users can browse files, automatically open associated applications with a mouse click, or drag files and directories into and out of the cluster. Additionally, standard command-line tools and UNIX applications and utilities (such as grep, tar, sort, and tail) can be used directly on data in the cluster. With other Hadoop distributions, the user must copy the data out of the cluster in order to use standard tools. MapR snapshots: MapR provides snapshots that are atomic and transactionally consistent. MapR snapshots provide protection from user and application errors with flexible schedules to accommodate a range of recovery point objectives. MapR snapshots can be scheduled or performed on demand. Recovering from a snapshot is as easy as dragging the directory or files to the current directory. MapR snapshots offer high performance and space efficiency. No data is copied in order to create a snapshot. As a result, a snapshot of a petabyte volume can be performed in seconds. A snapshot operation does not have any impact on write performance because MapR uses redirect-on-write to implement snapshots. All writes in MapR go to new blocks on disk. This means that a snapshot retains references to the old blocks and does not require copying data blocks. MapR volumes: MapR volumes make cluster data both easy to access and easy to manage by creating logical partitions. This lets you organize related files and directories into a single tree structure so they can be easily managed and secured. MapR volumes provide the ability to apply policies including the following: replication factor, scheduled mirroring, scheduled snapshots, data placement and topology control, quotas and usage tracking, administrative permissions. Volume is a key component to achieving true multi-tenancy in Hadoop cluster, in which distinct data sets, user groups, and applications can be kept isolated from each other. MapR mirroring: MapR makes removing data replication easy and built-in. Going far beyond the file copy techniques of other Hadoop distributions, MapR mirroring means that you can set policies around your recovery point objectives (RPO) and recovery time objectives (RTO). You can mirror your data automatically within your cluster, between clusters, (such as a production and a research cluster) or between sites. MapR mirroring provides a solid foundation for disaster recovery MapR Database MapR-DB is implemented directly within MapR-FS, yielding a familiar and open-standards API that provides a high-performance data store for tables. MapR-FS is written in C and optimized for performance. As a result, MapR-FS runs significantly faster than JVM-based Apache HBase. MapR-FS enables you to create and manipulate tables in many of the same ways that you create and manipulate files in a standard UNIX file system. MapR's implementation of the HBase API provides enterprise-grade high availability (HA), data protection, and disaster recovery features for tables on a distributed Hadoop cluster. MapR tables can be used as the underlying key-value store for Hive, or any other application requiring a high-performance, high-availability key-value data store. 9 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
13 Real-Time Operational Analytics: Operational analytics entails analyzing live, operational data immediately to deliver outputs in real time. With MapR-DB and Hadoop on the same cluster, you add immediacy to the analysis of your live, operational data, without needing to copy data across separate clusters. This converges the real-time data access of NoSQL databases with the large-scale parallel processing of Hadoop. A few examples of operational analytics use cases include customer service optimization, real-time ad targeting, real-time personalization, and logistics route optimization. MapR-DB also supports a real-time, reliable data transport subsystem that leverages the underlying foundation of the Table Replication feature. The MapR-DB and Elasticsearch integration uses this subsystem to deliver database updates in real time to your search indexes to ensure your full-text searches reflect the latest information. This subsystem is the basis for future integrations with additional external compute engines, including Spark Streaming and Apache Storm. Unified Data Access: By integrating Hadoop and NoSQL data in one platform, you get a single source of truth which can be accessed by a variety of methods. You can update a more comprehensive set of data and still have your analysis tools including business intelligence, visualizations, SQL-based querying, and search available in the same interface. With that unified view, you get the complete picture of data that you need for gaining better business-critical insights. Architectural/Administrative Simplicity: A simplified architecture means less risk when running a production system. It reduces the potential for error and time-consuming troubleshooting that complexity brings. The consolidated platform also reduces the costs of maintaining two separate clusters. You no longer have duplicated effort around system administration, high availability and disaster recovery setup, data governance, security configuration, auditing, and system upgrades. Duplicate software licenses for business intelligence, data governance, data integration, etc. across clusters are also eliminated, reducing unnecessary technology expenditures. The Power of MapR Reliability, Security, Performance, Scalability: All the powerful features that make the MapR Distribution including Apache Hadoop ready for business-critical production deployments are also available for MapR-DB. With the optimized, integrated architecture that treats all data similarly, MapR requires no additional administration for the NoSQL data. High performance, extreme scalability, high availability, disaster recovery, snapshots, integrated security, multi-tenancy, and low administrative overhead all make MapR an ideal platform for your production-ready operational analytics. For your highly secure data architectures, MapR-DB gives you the features you need to protect your sensitive data. Role-based access control expressions (ACEs) provide fine-grained controls on your sensitive data to allow only authorized viewers. And comprehensive auditing at the table, column family, and column levels help you meet your regulatory compliance requirements. More information on MapR security can be found here. 5.3 Open Source Projects MapR includes several open source projects many of which are shown in the Hadoop ecosystem box: Apache Hadoop A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. YARN and MapReduce components are included. Apache Drill - A low-latency, schema free, distributed, ANSI SQL query engine for large-scale datasets, including structured and semi-structured/nested data. Apache Pig A language and runtime for analyzing large data sets, consisting of a high-level language for expressing data analysis programs and an infrastructure for evaluating those programs. 10 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
14 Apache Spark A fast and general engine for large-scale data processing. Apache Tez An application framework which allows for a complex directed-acyclic-graph of tasks for processing data (available as a developer preview). Apache Mahout A scalable machine learning library. Apache Hive A data warehouse system for Hadoop that facilitates easy data summarization, ad hoc queries, and the analysis of large data sets stored in Hadoop compatible file systems. Impala An analytic database architected specifically to leverage the flexibility and scalability strengths of Hadoop - combining the familiar SQL support and multi-user performance of a traditional analytic. Apache HBase The Hadoop database, a distributed, scalable, big data store. Solr Popular, blazing-fast, open source enterprise search platform built on Apache Lucene. Apache Sqoop A tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. Apache Flume A distributed, reliable, and highly available service for efficiently moving large amounts of data around a cluster. Hue A Web interface for analyzing data with Apache Hadoop. Sentry A system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster. Apache Oozie A workflow coordination manager Apache ZooKeeper A distributed service for maintaining configuration information, providing distributed synchronization, and providing group service. Sahara A simple means to provision a data-intensive application cluster (Hadoop or Spark) on top of OpenStack. 11 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
15 6 Operational model This section describes the operational model for the MapR reference architecture. To show the operational model for different sized customer environments, four different models are provided for supporting different amounts of data. Throughout the document, these models are referred to as starter rack, half rack, full rack, and multi-rack configuration sizes. The multi-rack is three times larger than the full rack. A MapR deployment consists of cluster nodes, networking, power, and racks. The predefined configurations can be implemented as is or modified based on specific customer requirements, such as lower cost, improved performance, and increased reliability. Key workload requirements such as the data growth rate, sizes of datasets, and data ingest patterns help in determining the proper configuration for a specific deployment. A best practice when a MapR cluster infrastructure is designed is to conduct the proof of concept testing by using representative data and workloads to ensure that the proposed design works. 6.1 Hardware description This reference architecture uses Lenovo System x3650 M5 servers and Lenovo RackSwitch G8052 and G8272 top of rack switches Lenovo System x3650 M5 Server The Lenovo System x3650 M5 server (as shown in Figure 3) is an enterprise class 2U two-socket versatile server that incorporates outstanding reliability, availability, and serviceability (RAS), security, and high-efficiency for business-critical applications and cloud deployments. It offers a flexible, scalable design and simple upgrade path to inch hard disk drives (HDDs) or solid-state drives (SSDs), or inch HDDs with doubled data transfer rate through 12 Gbps serial-attached SCSI (SAS) internal storage connectivity and up to 1.5TB of TruDDR4 Memory. On-board it provides four standard embedded Gigabit Ethernet ports and two optional embedded 10 Gigabit Ethernet ports without occupying PCIe slots. Figure 3: Lenovo x3650 M5 Combined with the Intel Xeon processor E v3 product family, the Lenovo x3650 M5 server offers an even higher density of workloads and performance that lowers the total cost of ownership (TCO) per virtual machine. Its pay-as-you-grow flexible design and great expansion capabilities solidify dependability for any kind of virtualized workload, with minimal downtime. The Lenovo x3650 M5 server provides internal storage density of up to 87.6 TB in a 2U form factor with its impressive array of workload-optimized storage configurations. It also offers easy management and saves floor space and power consumption for most demanding storage virtualization use cases by consolidating storage and server into one system. 12 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
16 The reference architecture recommends the storage-rich System x3650 M5 model for the following reasons: Storage capacity: The nodes are storage-rich. Each of the inch drives has raw capacity up to 6 TB and each of two 2.5-inch drives has raw capacity of 1.8 TB for a total of 87.6 TB per node and over 1 petabyte per rack. Performance: This hardware supports the latest Intel Xeon processors and TruDDR4 Memory. Flexibility: Server hardware uses embedded storage, which results in simple scalability (by adding nodes). More PCIe slots: Up to 8 PCIe slots are available if rear disks are not used, and up to 2 PCIe slots if both Rear 3.5-inch HDD Kit and Rear 2.5-inch HDD Kit are used. They can be used for network adapter redundancy and increased network throughput. Better power efficiency: Innovative power and thermal management provides energy savings. Reliability: Lenovo is first in the industry in reliability and has exceptional uptime with reduced costs. For more information, see the Lenovo System x3650 M5 Product Guide: lenovopress.com/tips Lenovo RackSwitch G8052 The Lenovo System Networking RackSwitch G8052 (as shown in Figure 4) is an Ethernet switch that is designed for the data center and provides a virtualized, cooler, and simpler network solution. The Lenovo RackSwitch G8052 offers up to 48 1 GbE ports and up to 4 10 GbE ports in a 1U footprint. The G8052 switch is always available for business-sensitive traffic by using redundant power supplies, fans, and numerous high-availability features. Figure 4: Lenovo RackSwitch G8052 Lenovo RackSwitch G8052 has the following characteristics: Forty-eight 1 GbE RJ45 ports Four standard 10 GbE SFP+ ports Low 130W power rating and variable speed fans to reduce power consumption For more information, see the RackSwitch G8052 Product Guide: lenovopress.com/tips Lenovo RackSwitch G8272 Designed with top performance in mind, Lenovo RackSwitch G8272 is ideal for today s big data, cloud, and optimized workloads. The G8272 switch offers up to Gb SFP+ ports in a 1U form factor and is expandable with four 40 Gb QSFP+ ports. It is an enterprise-class and full-featured data center switch that deliver line-rate, high-bandwidth switching, filtering, and traffic queuing without delaying data. Large data center grade buffers keep traffic moving. Redundant power and fans and numerous HA features equip the switches for business-sensitive traffic. The G8272 switch (as shown in Figure 5) is ideal for latency-sensitive applications, such as client virtualization. It supports Lenovo Virtual Fabric to help clients reduce the number of I/O adapters to a single dual-port 10 Gb 13 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
17 adapter, which helps reduce cost and complexity. The G8272 switch supports the newest protocols, including Data Center Bridging/Converged Enhanced Ethernet (DCB/CEE) for support of FCoE and iscsi and NAS. Figure 5: Lenovo RackSwitch G8272 The enterprise-level Lenovo RackSwitch G8272 has the following characteristics: 48 x SFP+ 10GbE ports plus 6 x QSFP+ 40GbE ports Support up to 72 x 10Gb connections using break-out cables 1.44 Tbps non-blocking throughput with very low latency (~ 600 ns) Up to 72 1Gb/10Gb SFP+ ports OpenFlow enabled allows for easily created user-controlled virtual networks For more information, see the RackSwitch G8272 Product Guide: lenovopress.com/tips Cluster nodes The MapR reference architecture is implemented on a set of nodes that make up a cluster. Nodes use System x3650 M5 servers with locally attached storage. MapR runs well on a homogenous server environment with no need for different hardware configurations for management and data services. Server nodes can run three different types of services which are Map data services, MapR management services and other optional MapR Services. Unlike other Hadoop distributions that require different server configurations for management nodes and data nodes, the MapR reference architecture requires only a single MapR node hardware configuration. Each node is then configured to run one or more of the mentioned services. Table 1 lists the recommended system components for cluster nodes. Table 1: Node predefined configuration Component System Processor Memory - base Predefined configuration System x3650 M5 2 x Intel Xeon Processor E v3 2.3GHz 10-core 128 GB 8 x 16GB 2133MHz RDIMM (minimum) Disk (OS) 1.2 TB drives: 2 x 1.2TB SAS 2.5 inch 1.8 TB drives: 2 x 1.8TB SAS 2.5 inch Disk (data) 4 TB drives: 14 x 4TB NL SATA 3.5 inch (56 TB Total) 6 TB drives: 14 x 6TB NL SATA 3.5 inch (84 TB Total) HDD controller Hardware storage protection Hardware management network adapter Data network adapter 12Gb JBOD controller None (JBOD). By default, MapR maintains a total of three copies of data stored within the cluster. The copies are distributed across data servers and racks for fault recovery. Integrated 1GBaseT Adapter Mellanox ConnectX-3 10 GbE Adapter 14 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
18 The Intel Xeon processor E v3 is recommended to provide sufficient performance. A minimum of 128 GB of memory is recommended for most MapReduce workloads with 256 GB or more recommended for Drill, Spark, and memory-intensive MapReduce workloads. Two sets of disks are used, one set of disks is for operating system and the other set of disks is for data. For the operating system disks, RAID 1 mirroring should be used. Each node in the reference architecture has an internal storage. External storage is not used in this reference architecture. Available data space assumes the use of Hadoop replication with three copies of the data, and 25% capacity reserved for efficient file system operation and to allow time to increase capacity if needed. In situations where higher storage capacity is required, the main design approach is to increase the amount of data disk space per node. Using 6 TB drives instead of 4 TB drives increases the total per node data disk capacity from 56 TB to 84 TB, a 50% increase. When increasing data disk capacity, there is a balance between performance and throughput. For some workloads, increasing the amount of user data that is stored per node can decrease disk parallelism and negatively affect performance. Increasing drives sizing also affects rebuilding and repopulating the replicas if there is a disk or node failure. Higher density disks or nodes results in higher rebuild times. Drives that are larger than 4 TB are not recommended based on the balance of capacity and performance. In this case, higher capacity can be achieved by increasing the number of nodes in the cluster. For higher IO throughout, the data node can be configured with inch SAS drives, which have less storage capacity but much higher IO throughout. For the HDD controller, just a bunch of disks (JBOD) is the best choice for a MapR cluster. It provides excellent performance and, when combined with the Hadoop default of 3x data replication, also provides significant protection against data loss. The use of RAID with data disks is discouraged because it reduces performance and the amount data that can be stored. Data nodes can be customized according to client needs. A minimum of three data nodes are required as Hadoop has three copies of data by default. Three data nodes should be used for test or POC environments only. A minimum of five data nodes are required for production environment if there are data node failures. The number of nodes recommended for running management services and data services vary based on the size of the configuration. MapR is very flexible in its ability to use any node for any management service. Depending on workloads and HA requirements, multiple nodes could be dedicated to a single management service, multiple management services, or both management and data services. The number of nodes running management services can be customized based on specific workloads and HA requirements. Table 2 shows the number of nodes that should run management services depending on the size of cluster and the best practices for allocating the management services to the nodes. 15 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
19 Table 2: Number of nodes running MapR management services Total nodes Management nodes Breakout of function < 40 3 to 5 Dedicated management nodes are not required. Management services run on nodes that also run data and optional services. CLDB on two or three nodes. ResourceManager on two or three nodes. ZooKeeper on three nodes (run ZooKeeper on an odd number of nodes). Reduce the number of task slots on servers running both data and management services to ensure the processor and memory resources are available to management services. WebServer and NFS server can also be run on nodes running management services. For faster failover recovery times, avoid running ZooKeeper and CLDB services on the same servers. 40 to to 7 Dedicated management nodes are not required. Management services run on nodes that also run data and optional services. CLDB on two to four nodes. ResourceManager on two or three nodes. Zookeeper on three or five nodes (run ZooKeeper on an odd number of nodes). Reduce the number of task slots on servers running both data and management services to ensure that the processor and memory resources are available to management services. Web server and NFS server can also be run on nodes running management services. Spread management services across racks and across nodes to avoid running an instance of each management service on a single node. For faster failover recovery times, avoid running ZooKeeper and CLDB services on the same servers. > or more Dedicate nodes to management services. Do not run data or optional services on the same nodes running management services. CLDB on three or more nodes ResourceManager on two or three nodes. ZooKeeper on five nodes (run ZooKeeper on an odd number of nodes). On very large clusters, dedicate nodes to running only CLDB and On very large clusters, dedicate nodes to running only ZooKeeper. Spread management services across racks and across nodes to avoid running an instance of each management service on a single node. 16 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
20 In clusters up to 100 nodes, management services typically reside on nodes that also provide data services. For a very small cluster that does not require failover or high availability, all the management services and data services can run on one node. However, HA is recommended and requires at least three nodes running management services. Even if high availability is not required, MapR Enterprise Database Edition provides snapshots, mirrors, and multiple NFS servers. Also, the HA features of Enterprise Database Edition provide a mechanism for administrators to perform rolling upgrades of management processes without any downtime on the cluster. A small cluster with less than 40 nodes should be set up to run the management services on three to five nodes for high availability. A medium-sized cluster should be set up to run the management services on at least five nodes with ZooKeeper, CLDB, and ResourceManager nodes distributed across racks. This environment provides failover and high availability for all critical services. For clusters over 100 nodes, some nodes can be dedicated to management services. A large cluster can be set up to run the management services on a minimum of seven nodes with these nodes distributed across racks. In a large cluster, isolate the CLDB service from other services by placing them on dedicated nodes. In addition, in large clusters, ZooKeeper services should be isolated from other services on dedicated nodes. To reduce recovery time upon node failure, avoid running CLDB and ZooKeeper on the same node. 6.3 Systems management Systems management includes cluster system management and hardware management. Cluster systems management uses MapR Control System and adapted from the standard Hadoop distribution which places the management services on servers cross cluster. Hardware management uses the Lenovo XClarity Administrator which is a centralized resource management solution that reduces complexity, speeds up response, and enhances the availability of Lenovo server systems and solutions. The Lenovo XClarity Administrator provides agent-free hardware management for Lenovo s System x rack servers and Flex System compute nodes and components, including the Chassis Management Module (CMM) and Flex System I/O modules. Figure 6 shows the Lenovo XClarity administrator interface, in which Flex System components and rack servers are managed and are seen on the dashboard. Lenovo XClarity Administrator is a virtual appliance that is quickly imported into a virtualized environment server configuration. 17 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
21 Figure 6: XClarity Administrator interface Also xcat provides a scalable distributed computing management and provisioning tool that provides a unified interface for hardware control, discovery, and operating system deployment. It can be used to facilitate or automate the management of cluster nodes. For more information about xcat, see the Resources on page Networking Regarding networking, the reference architecture specifies two networks: a data network and an administrative or management network. Figure 7 shows the networking configuration for. 18 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
22 Figure 7: MapR network configuration Data network The data network is a private cluster data interconnect among nodes that is used for data access, moving data across nodes within a cluster, and importing data into the MapR file system. The MapR cluster typically connects to the customer s corporate data network. Two top of rack switch is required for the data network that is used by MapR. Two 1GbE or 10GbE switches can be used(a 1GbE switch is sufficient for some workloads). The recommended 1GbE switch is the Lenovo RackSwitch G8052. The 10GbE switch can provide extra I/O bandwidth for better performance. The recommended 10 GbE switch is the Lenovo System Networking RackSwitch G8272. The two Mellanox 10 GbE ports of each node are link aggregated to the recommended G8272 rack switch for better performance and improved HA. The data network is configured to use a virtual local area network (VLAN) Hardware management network The hardware management network is a 1GbE network that is used for in-band operating system administration and out-of-band hardware management. In-band administrative services, such as SSH or Virtual Network Computing (VNC)that is running on the host operating system enables cluster nodes to be administered. Through the integrated management modules II (IMM2) within the System x3650 M5 server, out-of-band management enables the hardware-level management of cluster nodes, such as node deployment or basic input/output system (BIOS) configuration. 19 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
23 Hadoop has no dependency on the IMM2. Based on customer requirements, the administration links and management links can be segregated onto separate VLANs or subnets. The administrative or management network is typically connected directly to the customer s administrative network. When the in-band administrative services on the host operating system are used, MapR is configured to use the data network only. By default, MapR uses all the available network interfaces. The reference architecture requires one 1 GbE top of rack switch for the hardware management network. Administrator also can access all of the nodes in the cluster through the customer admin network, as shown in Figure 7 on page 19. This rack switch for the hardware management network is connected to each of the nodes in the cluster by using two physical links (one for in-band operating system administration and one link for out-of-band IMM2 hardware management). On the nodes, the administration link connects to port 1 on the integrated 1GBaseT adapter and the management link connects to the dedicated IMM2 port Multi-rack network Appropriate other Lenovo RackSwitch G8316 core switches per cluster. In this case, the second Mellanox 10 GbE port can be connected to the second Lenovo RackSwitch G8272. Figure 8 on page 21 shows how the network is configured when the MapR cluster is installed across more than one rack. The data network is connected across racks by two aggregated 40GbE uplinks from each rack s G8272 switch to a core G8316 switch. 20 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
24 Figure 8: MapR cross rack network configuration 1 A 40GbE switch is recommended for interconnecting the data network across multiple racks. Lenovo System Networking RackSwitch G8316 is the recommended switch. A best practice is to have redundant core switches for each rack to avoid a single point of failure. Within each rack, the G8052 switch can optionally be configured to have two uplinks to the G8272 switch to allow propagation of the administrative/management VLAN across cluster racks through the G8316 core switch. For large clusters, the Lenovo System Networking RackSwitch G8332 is recommended because it provides a better cost value per 40Gb port than the G8316.Core switch are used for cross rack data transfer. Many other cross rack network configurations are possible and may be required to meet the needs of specific deployments or to address clusters larger than three racks. If the solution is initially implemented as a multi-rack solution, or if the system grows by adding additional racks, the nodes that provide management services should be distributed across racks to maximize fault tolerance. 6.5 Predefined cluster configurations The intent of the four predefined configurations is to ease initial sizing for customers and to show example starting points for different sized workloads. The reference architecture is not limited to these four sized clusters. The starter rack configuration consists of three nodes and a pair of rack switches. The half rack 1 To simplify the diagram, only one G8272 is drawn in the diagram, and in recommended configuration, two G8272 are configured for HA. 21 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
25 configuration consists of ten nodes and a pair of rack switches. The full rack configuration (a rack fully populated) consists of 19 nodes and a pair of rack switches. The multi-rack contains a total of 57 nodes; 19 nodes and a pair of switches in each rack. Table 3 lists four predefined configurations for the MapR reference architecture. The table also lists the amount of space for data and the number of nodes that each predefined configuration provides. Storage space is described in two ways; the total amount of raw storage space when 4 TB or 6 TB drives (raw storage) are used and the amount of space for the data that the customer has (available data space). Available data space assumes the use of Hadoop replication with three copies of the data, and 25% capacity that is reserved for efficient file system operation and to allow time to increase capacity if needed. Available data space might increase significantly with MapR automatic compression. The estimates that are listed in Table 3 do not include extra space that is freed up by using compression because compression rates can vary widely based on file contents. Table 3: Pre-defined configurations Starter Rack Half Rack Full Rack Multi-Rack Raw storage (4TB) 168 TB 560 TB 1120 TB 3360 TB Available data space (4TB) 42 TB 140 TB 280 TB 840 TB Raw storage (6TB) 252 TB 840 TB 1680 TB 5040 TB Available data space (6TB) 63 TB 210 TB 420 TB 1260 TB Number of nodes Number of Racks Number of 10 GbE cables Number of 1 GbE cables The number of nodes that are required in the cluster to support these four predefined configurations are shown in Table 3. These are the estimates for highly available clusters. Three nodes are required to support a customer deployment that has 63 TB of data. Ten nodes are needed to support a customer deployment that has 210 TB of data, and so on. Figure 9 shows an overview of the architecture in three different one-rack sized clusters without network redundancy: a starter rack, a half rack and a full rack. Figure 10 shows a multi-rack-sized cluster without network redundancy. 22 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
26 Figure 9: Starter rack, half rack, and full rack MapR predefined configurations 23 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
27 Figure 10: Multi-rack MapR predefined configuration 24 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
28 7 Deployment considerations This section describes other considerations for deploying the MapR solution. 7.1 Increasing cluster performance There are two approaches that can be used to increase cluster performance: increasing node memory and the use of a high-performance job scheduler and MapReduce framework. Often, improving performance comes at increased cost and you must consider the cost-to-benefit trade-offs of designing for higher performance. In the MapR predefined configuration, node memory can be increased to 256 GB by using 16 16GB RDIMMs. An even larger memory configuration can provide greater performance depending on the workload. 7.2 Designing for lower cost There are two key modifications that can be made to lower the cost of a MapR reference architecture solution. When lower-cost options are considered, it is important to ensure that customers understand the potential lower performance implications of a lower cost design. A lower cost version of the MapR reference architecture can be achieved by using lower cost node processors and lower cost cluster data network infrastructure. The node processors can be substituted with the Intel Xeon E v3 2.4 GHz 8-core processor. This processor requires 1866 MHz RDIMMs, which can also lower the per-node cost of the solution. The use of a lower cost network infrastructure can significantly lower the cost of the solution, but can also have a substantial negative effect on intra-cluster data throughput and cluster ingest rates. To use a lower cost network infrastructure, use the following substitutions to the predefined configuration: Within each node, substitute the Mellenox 10GbE dual SFP+ network adapter with the extra ports on the integrated 1GBaseT adapters within the System x3650 M5 server. Within each rack, substitute the Lenovo RackSwitch G8272 top of rack switch with the Lenovo RackSwitch G8052. Within each cluster, substitute the Lenovo RackSwitch G8316 core switch with the Lenovo RackSwitch G8272. Although the network wiring schema is the same as the schema that is described in "Networking", the media types and link speeds within the data network changed. The data network within a rack that connects the cluster nodes to the lower cost option, G8052 top of rack switch, is now based on two aggregated 1GBaseT links per node. The physical interconnect between the administrative or management networks and the data networks within each rack is now based on two aggregated 1 GbE links between the admin/management network G8052 switch and the lower cost data network G8052 switch. Within a cluster, the racks are interconnected through two aggregated 10GbE links between the substitute G8052 data network switch in each rack and a lower cost G8272 core switch. 7.3 Designing for high ingest rates Designing for high ingest rates is difficult. It is important to have a full characterization of the ingest patterns and volumes. The following questions provide guidance to key factors that affect the rates: On what days and at what times are the source systems available or not available for ingest? 25 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
29 When a source system is available for ingest, what is the duration for which the system remains available? Do other factors affect the day, time, and duration ingest constraints? When ingests occur, what is the average and maximum size of ingest that must be completed? What factors affect ingest size? What is the format of the source data (structured, semi-structured, unstructured)? Are there any data transformation or cleansing requirements that must be achieved during ingest? To increase the data ingest rates, consider the following points: Ingest data with MapReduce job which helps to distribute the I/O load to different nodes cross the cluster. Ingest during cluster load is not high, if possible. Compressing data is a good option in many cases which reduces the I/O load to disk and network. Filter and reduce data in earlier stage saves more costs. 7.4 Estimating disk space When you are estimating disk space within a MapR Hadoop cluster, consider the following points: For improved fault tolerance and performance, the MapR file system replicates data blocks across multiple cluster nodes. By default, the file system maintains three replicas. Compression ratio is an important consideration in estimating disk space and can vary greatly based on file contents. MapR provides automatic compression. Available data space might increase significantly with MapR automatic compression. If the customer s data compression ratio is unavailable, assume a compression ratio of 2.5 :1. To ensure efficient file system operation and to allow time to add more storage capacity to the cluster if necessary, reserve 25% of the total capacity of the cluster. Assuming the default three replicas maintained by the MapR file system, the raw data disk space, and the required number of nodes can be estimated by using the following equations: Total raw data disk space = (User data, uncompressed) * (4 / compression ratio) Total required nodes = (Total raw data disk space) / (Raw data disk per node) You should also consider future growth requirements when estimating disk space. Based on these sizing principles, Table 4 shows an example for a cluster that must store 500 TB of uncompressed user data. The example shows that the MapR cluster will need to have 800TB of raw disk to support 500 TB of uncompressed data. The 800 TB is for data storage and does not include operating system disk space. A total of nine nodes, or nearly a half rack are required to support deployment of this size. 26 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
30 Table 4: Example of storage sizing with 6 TB drives Description Size of uncompressed user data Value 500 TB Compression ratio 2.5:1 Size of compressed data 200 TB Storage multiplication factor 4 Raw data disk space needed for MapR cluster 800 TB - Storage needed for MapR-FS 3x replication 600 TB - Reserved Storage for headroom 200 TB Raw data disk per node (with 6 TB drives) 84 TB Minimum number of nodes required Scaling considerations The Hadoop architecture is linearly scalable but it is important to note that some workloads might not scale completely linearly. When the capacity of the infrastructure is reached, the cluster can be scaled out by adding nodes. Typically, identically configured nodes are best to maintain the same ratio of storage and compute capabilities. A MapR cluster is scalable by adding System x3650 M5 nodes and network switches and optionally adding management services and optional services on those nodes. MapR No NameNode architecture allows linear scalability to trillions of files and thousands of petabytes. As the capacity of racks is reached, new racks can be added to the cluster. When a MapR reference architecture implementation is designed, future scale out should be a key consideration in the initial design. There are two key aspects to consider: networking and management. These aspects are critical to cluster operation and become more complex as the cluster infrastructure grows. The cross rack networking configuration that is in Figure 8 on page 21 provides robust network interconnection of racks within the cluster. As racks are added, the predefined networking topology remains balanced and symmetrical. If there are plans to scale the cluster beyond one rack, a best practice is to initially design the cluster with multiple racks(even if the initial number of nodes fit within one rack). Starting with multiple racks can enforce proper network topology and prevent future re-configuration and hardware changes. As racks are added over time, multiple G8316 switches might be required for greater scalability and balanced performance. Also, as the number of nodes within the cluster increases, so do many of the tasks of managing the cluster, such as updating node firmware or operating systems. Building a cluster management framework as part of the initial design and proactively considering the challenges of managing a large cluster pays off significantly in the long run. 27 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
31 Proactive planning for future scale out and the development of cluster management framework as a part of initial cluster design provides a foundation for future growth that can minimize hardware reconfigurations and cluster management issues as the cluster grows. 7.6 High availability considerations When a MapR cluster on System x is implemented, consider availability requirements as part of the final hardware and software configuration. Typically, Hadoop is considered a highly reliable solution, but MapR enhancements make it highly available. Hadoop and MapR best practices provide significant protection against data loss. MapR ensures that failures are managed without causing an outage. There is redundancy that can be added to make a cluster even more reliable. Some consideration must be given to hardware and software redundancy Networking considerations Optionally, a second redundant switch can be added to ensure HA of the hardware management network. The hardware management network does not affect the availability of the MapR-FS or Hadoop functionality, but it might affect the management of the cluster; therefore, availability requirements must be considered. MapR provides application-level Network Interface Card (NIC) bonding for higher throughput and high availability. Customers can either choose MapR application-level bonding or OS-level bonding and switch-based aggregation of some form matching the OS bonding configuration when using multiple NICs. Virtual Link Aggregation Groups (vlag) can be used between redundant switches. If 1Gbps data network links are used, it is recommended that more than one is used per node to increase throughput Hardware availability considerations With no single point of failure, redundancy in server hardware components is not required for MapR. MapR automatically and transparently handles hardware failure resulting in the loss of any node in the cluster running any data or management service. MapR s default three-way replication of data ensures that no data is lost because two additional replicas of data are maintained on other nodes in the cluster. MapReduce tasks from failed nodes are automatically started on other nodes in the cluster. Failure of a node running any management service is automatically and transparently recovered as described in the following services. All ZooKeeper services are available for read operations, with one acting as the leader for all writes. If the node running the leader fails, the remaining nodes will elect a new leader. Most commonly, three ZooKeeper instances are used to allow HA operations. In some large clusters, five ZooKeeper instances are used to allow fully HA operations even during maintenance windows that affect ZooKeeper instances. The number of instances of ZooKeeper services that must be run in a cluster depends on the cluster s high availability requirement, but it should always be an odd number. ZooKeeper requires a quorum of (N/2)+1 to elect a leader where N is the total number of ZooKeeper nodes. Running more than five ZooKeeper instances is not necessary. All CLDB services are available for read operations, with one acting as the write master. If the node running the master CLDB service goes down, another running CLDB will automatically become the master. A minimum of two instances is needed for high availability. One ResourceManager service is active. Other ResourceManager instances are configured but not running. If the active ResourceManager goes down, one of the configured instances automatically 28 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
32 takes over without requiring any job to restart. A minimum of two instances is needed for high availability. All NFS servers are active simultaneously and can present an HA NFS server to nodes external to the cluster. To do this, specify the virtual IP addresses for two or more NFS servers for NFS high availability. Additionally, use round-robin Domain Name System (DNS) across multiple virtual IP addresses for load balancing in addition to high availability. For NFS access from within the cluster, NFS servers should be run on all nodes in the cluster and each node should mount its local NFS server. MapR WebServer can run on any node in the cluster to run the MapR Control System. The web server also provides a REST interface to all MapR management and monitoring functions. For HA, multiple active web servers can be run with users connecting to any web server for cluster management and monitoring. Note that even with no web server running, all monitoring and management capabilities are available using the MapR command line interface. Within racks, switches and nodes have redundant power feeds with each power feed connected from a separate PDU Storage availability RAID disk configuration is not necessary and should be avoided in MapR clusters. The use of RAID causes a negative impact on performance. MapR provides automated setup and management of storage pools. The three-way replication provided by MapR-FS provides higher durability than RAID configurations because multiple node failures might not compromise data integrity. If the default 3x replication is not sufficient for availability requirements, the replication factor can be increased on a file, volume, or cluster basis. Replication levels higher than 5 are not normally used. Mirroring of MapR volumes within a single cluster can be used to achieve very high replication levels for higher durability or for higher read bandwidth. Mirrors can be used between clusters as well. MapR efficiently mirrors by only copying changes to the mirror. Mirrors are useful for load balancing or disaster recovery. MapR also provides manual or scheduled snapshots of volumes to protect against human error and programming defects. Snapshots are useful for rollback to a known data set Software availability considerations Operating system availability is provided by using mirrored drives for the operating system. MapR distribution including Apache Hadoop is unique because it was designed with a No NameNode architecture for high availability. MapR is the only Hadoop distribution designed with no single point of failure. Other Hadoop distributions have a single primary NameNode and when the NameNode goes down, the entire cluster becomes unavailable until the NameNode is restarted. In those cases, where other distributions are configured with multiple NameNodes, the entire cluster becomes unavailable during the failover to a secondary NameNode. With MapR, the file metadata is replicated, distributed, and persistent, so that there is no data loss or downtime even in the face of multiple disk or node failures. The MapR ResourceManager HA improves recovery time objectives and provides for a self-healing cluster. Upon failure, the MapR ResourceManager automatically restarts on another node in the cluster. NodeManagers can automatically pause and then reconnect to the new ResourceManager. Any currently running jobs or tasks continue without losing any progress or failing. 29 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
33 You can easily set up a pool of NFS nodes with HA and failover using virtual IP addresses. If one node fails, the virtual IP addresses will be automatically reassigned to the next NFS node in the pool. It is also common to place an NFS server on every node where NFS access to the cluster is needed. 7.7 Migration considerations If migration of data or applications to MapR is required, you must consider the type and amount of data to be migrated and the source of the data being migrated. Most data type can be migrated, but you must understand the migration requirements to verify viability. Standard Hadoop tools such as distcp can be used to migrate data from other Hadoop distributions. For data in a POSIX file system, you need to NFS mount the MapR cluster and use standard Linux commands to copy the files into the MapR Hadoop cluster. Either Sqoop or database import/export tools with MapR NFS can be used to move data between databases and MapR Hadoop. You also need to consider whether applications must be modified to use Hadoop functionality. With MapR read/write file system that can be mounted by a standard NFS client, significant effort required to migrate applications to other Hadoop distributions can often be avoided. 30 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
34 8 Appendix: Bill of materials This appendix includes the Bill of Materials (BOMs) for different configurations of hardware for the Big Data Solution from MapR deployments. There are sections for nodes, networking, rack, and cables. The BOM includes the part numbers, component descriptions, and quantities. Table 3 on page 22 shows how many core components are required for each of the predefined configuration sizes. The BOM lists in this appendix are not meant to be exhaustive and must always be verified with the configuration tools. Any discussion of pricing, support, and maintenance options is outside the scope of this document. This BOM information is for the United States; part numbers and descriptions can vary in other countries. Other sample configurations are available from your Lenovo sales team. Components are subject to change without notice. The configuration reference number for 4 TB drives is BDAMAPRV453 and for 6 TB drives is BDAMAPRC Node Table 5 lists the BOM for the Node. Table 5. Node Code Description Quantity 5462AC1 IBM System x3650 M5 1 A5EA System x3650 M5 Planar 1 A5EU System x 750W High Efficiency Platinum AC Power Supply m, 10A/ V, C13 to IEC 320-C14 Rack Power Cable 2 A3YY N2215 SAS/SATA HBA for Lenovo System x 1 ASBM Lenovo 1.8TB 10K 12Gbps SAS 2.5" G3HS 512e HDD 2 A3PM Mellanox ConnectX-3 10 GbE Adapter for Lenovo System x 1 A5FX System x Enterprise 2U Cable Management Arm (CMA) 1 A483 Populate and Boot From Rear Drives 1 A5FV System x Enterprise Slides Kit 1 A1ML Lenovo Integrated Management Module Advanced Upgrade 1 A5GU Intel Xeon Processor E v3 10C 2.3GHz 25MB Cache 2133MHz 105W 1 A5EM Intel Xeon Processor E v3 10C 2.3GHz 25MB 2133MHz 105W 1 A5FF System x3650 M5 12x 3.5" Base without Power Supply 1 A5B7 16GB TruDDR4 Memory (2Rx4, 1.2V) PC CL MHz LP RDIMM 8 A5GE x3650 M5 12x 3.5" HS HDD Assembly Kit 1 A5GL System x3650 M5 Rear 2x 3.5" HDD Kit (Cascaded) 1 A5GH System x3650 M5 Rear 2x 2.5" HDD Kit (Independent RAID) 1 31 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
35 A5EU System x 750W High Efficiency Platinum AC Power Supply 1 A45W ServeRAID M1215 SAS/SATA Controller for Lenovo System x Rack Installation >1U Component 1 Select 4 TB or 6 TB drives for data storage A5VQ Lenovo 4TB 7.2K 12Gbps NL SAS 3.5" G2HS 512e HDD 14 A5VR Lenovo 6TB 7.2K 12Gbps NL SAS 3.5" G2HS 512e HDD Administration / Management network switch Table 6 lists the BOM for the Administration / Management network switch. Table 6. Administration / Management network switch Code Description Quantity 7159G52 Lenovo RackSwitch G8052 (Rear to Front) m, 10A/ V, C13 to IEC 320-C14 Rack Power Cable 2 A3KP IBM System Networking Adjustable 19" 4 Post Rail Kit Rack Installation of 1U Component Data network switch Table 7 lists the BOM for the Administration / Management network switch. Table 7. Administration / Management network switch Code Description Quantity 7159CRW Lenovo RackSwitch G8272 (Rear to Front) m, 10A/ V, C13 to IEC 320-C14 Rack Power Cable 2 A3KP IBM System Networking Adjustable 19" 4 Post Rail Kit Rack Installation of 1U Component Rack Table 8 lists the BOM for the Administration / Management network switch. Table 8. Administration / Management network switch Code Description Quantity 1410HPB Intelligent Cluster 42U 1100mm Enterprise V2 Dynamic Rack DPI Single-phase 30A/208V C13 Enterprise PDU (US) Cluster 1350 Ship Group Rack Assembly - 42U Rack Cluster Hardware & Fabric Verification - 1st Rack 1 32 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
36 Different cluster sizing leaves different unused Rack space; therefore, consider the use of blank plastic filter panel for the Rack to better direct cool air flow. And Lenovo provide 1U/3U/5U blank plastic filter panel. Rack PDU number is depending on the server numbers in the rack, and 4 PDU is for a half rack configuration, if you have full rack configuration, six PDU will be needed. 8.5 Cables Table 9 lists the BOM for the Administration / Management network switch. Table 9. Administration / Management network switch Code Description Quantity A1PJ 3m IBM Passive DAC SFP+ Cable 2 A4RA CAT5E IntraRack Cable Service 2 33 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
37 9 Acknowledgements This reference architecture document has benefited very much from the detailed and careful review comments provided by Mike Perks of Lenovo. 34 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
38 10 Resources For more information, see the following resources: Lenovo System x3650 M5 (Cloudera Data Node): o Product page: shop.lenovo.com/us/en/systems/servers/racks/systemx/x3650-m5/ o Lenovo Press product guide: lenovopress.com/tips1193 Lenovo RackSwitch G8052 (1GbE Switch): o Product page: shop.lenovo.com/us/en/systems/browsebuy/%20rackswitch-g8052.html o Lenovo Press product guide: lenovopress.com/tips0813 Lenovo RackSwitch G8272 (10GbE Switch): o Product page: shop.lenovo.com/us/en/systems/browsebuy/lenovo-rackswitch-g8272.html o Lenovo Press product guide: lenovopress.com/tips1267 Lenovo XClarity Administrator: o Product page: shop.lenovo.com/us/en/servers/thinkserver/system-management/xclarity o Lenovo Press product guide: lenovopress.com/tips1200 MapR: o MapR main website: o MapR products: o MapR editions overview: o MapR architecture overview: o MapR blogs: o MapR Resources: o MapR products and differentiation: o MapR documentation:doc.mapr.com o MapR free Hadoop training: o MapR getting started: Open source software: o Hadoop: hadoop.apache.org o Pig: pig.apache.org o Cascading: o Spark:spark.apache.org o Apache Tez:tez.apache.org o Mahout: mahout.apache.org o Hive: hive.apache.org o Impala: o Drill:drill.apache.org o Solr :lucene.apache.org/solr o Sqoop: sqoop.apache.org o Flume: flume.apache.org o Hue:gethue.com o Sentry: sentry.incubator.apache.org o Oozie: oozie.apache.org o ZooKeeper: zookeeper.apache.org 35 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
39 o Sahara:wiki.openstack.org/wiki/Sahara Other resources o xcat: xcat.sourceforge.net 36 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
40 Trademarks and special notices Copyright Lenovo References in this document to Lenovo products or services do not imply that Lenovo intends to make them available in every country. Lenovo, the Lenovo logo, and XClarity are trademarks of Lenovo. Intel, the Intel logos, and Xeon are trademarks of Intel Corporation in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Information is provided "AS IS" without warranty of any kind. All customer examples described are presented as illustrations of how those customers have used Lenovo products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Information concerning non-lenovo products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by Lenovo. Sources for non-lenovo list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. Lenovo has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-lenovo products. Questions on the capability of non-lenovo products should be addressed to the supplier of those products. All statements regarding Lenovo future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local Lenovo office or Lenovo authorized reseller for the full text of the specific Statement of Direction. Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in Lenovo product announcements. The information is presented here to communicate Lenovo s current investment and development activities as a good faith effort to help with our customers' future planning. Performance is based on measurements and projections using standard Lenovo benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here. Photographs shown are of engineering prototypes. Changes may be incorporated in production models. Any references in this information to non-lenovo websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this Lenovo product and use of those websites is at your own risk. 37 Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
IBM System x reference architecture for Hadoop: MapR
IBM System x reference architecture for Hadoop: MapR May 2014 Beth L Hoffman and Billy Robinson (IBM) Andy Lerner and James Sun (MapR Technologies) Copyright IBM Corporation, 2014 Table of contents Introduction...
Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads
Solution Overview Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads What You Will Learn MapR Hadoop clusters on Cisco Unified Computing System (Cisco UCS
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack HIGHLIGHTS Real-Time Results Elasticsearch on Cisco UCS enables a deeper
HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics
HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how EMC Elastic Cloud Storage (ECS ) can be used to streamline the Hadoop data analytics
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Lenovo ThinkServer and Cloudera Solution for Apache Hadoop
Lenovo ThinkServer and Cloudera Solution for Apache Hadoop For next-generation Lenovo ThinkServer systems Lenovo Enterprise Product Group Version 1.0 December 2014 2014 Lenovo. All rights reserved. LENOVO
Scala Storage Scale-Out Clustered Storage White Paper
White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
Big Data Management and Security
Big Data Management and Security Audit Concerns and Business Risks Tami Frankenfield Sr. Director, Analytics and Enterprise Data Mercury Insurance What is Big Data? Velocity + Volume + Variety = Value
How Cisco IT Built Big Data Platform to Transform Data Management
Cisco IT Case Study August 2013 Big Data Analytics How Cisco IT Built Big Data Platform to Transform Data Management EXECUTIVE SUMMARY CHALLENGE Unlock the business value of large data sets, including
Redpaper. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture. Introduction
Redpaper Steven Hurley James C. Wang IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture Introduction The IBM System x reference architecture is a predefined
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2016 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Platfora Big Data Analytics
Platfora Big Data Analytics ISV Partner Solution Case Study and Cisco Unified Computing System Platfora, the leading enterprise big data analytics platform built natively on Hadoop and Spark, delivers
MapR Enterprise Edition & Enterprise Database Edition
MapR Enterprise Edition & Enterprise Database Edition Reference Architecture A PSSC Labs Reference Architecture Guide June 2015 Introduction PSSC Labs continues to bring innovative compute server and cluster
Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System
Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System By Jake Cornelius Senior Vice President of Products Pentaho June 1, 2012 Pentaho Delivers High-Performance
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Cisco IT Hadoop Journey
Cisco IT Hadoop Journey Alex Garbarini, IT Engineer, Cisco 2015 MapR Technologies 1 Agenda Hadoop Platform Timeline Key Decisions / Lessons Learnt Data Lake Hadoop s place in IT Data Platforms Use Cases
HadoopTM Analytics DDN
DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database
Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a
Apache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer [email protected], twitter: @awadallah Hadoop Past
Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
White Paper. Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference Configurations
White Paper Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference Configurations Contents Next-Generation Hadoop Solution... 3 Greenplum MR: Hadoop Reengineered... 3 : The Exclusive
EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise
EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise ESSENTIALS Easy-to-use, single volume, single file system architecture Highly scalable with
Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks
WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance
Private cloud computing advances
Building robust private cloud services infrastructures By Brian Gautreau and Gong Wang Private clouds optimize utilization and management of IT resources to heighten availability. Microsoft Private Cloud
Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches
Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches Introduction For companies that want to quickly gain insights into or opportunities from big data - the dramatic volume growth in corporate
CDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
Integrated Grid Solutions. and Greenplum
EMC Perspective Integrated Grid Solutions from SAS, EMC Isilon and Greenplum Introduction Intensifying competitive pressure and vast growth in the capabilities of analytic computing platforms are driving
EMC Unified Storage for Microsoft SQL Server 2008
EMC Unified Storage for Microsoft SQL Server 2008 Enabled by EMC CLARiiON and EMC FAST Cache Reference Copyright 2010 EMC Corporation. All rights reserved. Published October, 2010 EMC believes the information
Self-service BI for big data applications using Apache Drill
Self-service BI for big data applications using Apache Drill 2015 MapR Technologies 2015 MapR Technologies 1 Management - MCS MapR Data Platform for Hadoop and NoSQL APACHE HADOOP AND OSS ECOSYSTEM Batch
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Enterprise-grade Hadoop: The Building Blocks
Enterprise-grade Hadoop: The Building Blocks An Ovum white paper for MapR Publication Date: 24 Sep 2014 Author name Summary Catalyst Hadoop was initially developed for trusted environments that did not
Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage
Cisco for SAP HANA Scale-Out Solution Solution Brief December 2014 With Intelligent Intel Xeon Processors Highlights Scale SAP HANA on Demand Scale-out capabilities, combined with high-performance NetApp
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
How To Write An Article On An Hp Appsystem For Spera Hana
Technical white paper HP AppSystem for SAP HANA Distributed architecture with 3PAR StoreServ 7400 storage Table of contents Executive summary... 2 Introduction... 2 Appliance components... 3 3PAR StoreServ
Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000
Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000 Clear the way for new business opportunities. Unlock the power of data. Overcoming storage limitations Unpredictable data growth
Introduction to NetApp Infinite Volume
Technical Report Introduction to NetApp Infinite Volume Sandra Moulton, Reena Gupta, NetApp April 2013 TR-4037 Summary This document provides an overview of NetApp Infinite Volume, a new innovation in
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director [email protected] Dave Smelker, Managing Principal [email protected]
MarkLogic and Cisco: A Next-Generation, Real-Time Solution for Big Data
MarkLogic and Cisco: A Next-Generation, Real-Time Solution for Big Data MarkLogic Enterprise NoSQL Database and Cisco Unified Computing System provide a single, integrated hardware and software infrastructure
The Advantages of Multi-Port Network Adapters in an SWsoft Virtual Environment
The Advantages of Multi-Port Network Adapters in an SWsoft Virtual Environment Introduction... 2 Virtualization addresses key challenges facing IT today... 2 Introducing Virtuozzo... 2 A virtualized environment
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
EMC SOLUTION FOR SPLUNK
EMC SOLUTION FOR SPLUNK Splunk validation using all-flash EMC XtremIO and EMC Isilon scale-out NAS ABSTRACT This white paper provides details on the validation of functionality and performance of Splunk
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Maximum performance, minimal risk for data warehousing
SYSTEM X SERVERS SOLUTION BRIEF Maximum performance, minimal risk for data warehousing Microsoft Data Warehouse Fast Track for SQL Server 2014 on System x3850 X6 (95TB) The rapid growth of technology has
Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013
Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software SC13, November, 2013 Agenda Abstract Opportunity: HPC Adoption of Big Data Analytics on Apache
Self-service BI for big data applications using Apache Drill
Self-service BI for big data applications using Apache Drill 2015 MapR Technologies 2015 MapR Technologies 1 Data Is Doubling Every Two Years Unstructured data will account for more than 80% of the data
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
Big data management with IBM General Parallel File System
Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers
RPO represents the data differential between the source cluster and the replicas.
Technical brief Introduction Disaster recovery (DR) is the science of returning a system to operating status after a site-wide disaster. DR enables business continuity for significant data center failures
How To Use Hp Vertica Ondemand
Data sheet HP Vertica OnDemand Enterprise-class Big Data analytics in the cloud Enterprise-class Big Data analytics for any size organization Vertica OnDemand Organizations today are experiencing a greater
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Brocade Solution for EMC VSPEX Server Virtualization
Reference Architecture Brocade Solution Blueprint Brocade Solution for EMC VSPEX Server Virtualization Microsoft Hyper-V for 50 & 100 Virtual Machines Enabled by Microsoft Hyper-V, Brocade ICX series switch,
Upcoming Announcements
Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC [email protected] Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
The virtualization of SAP environments to accommodate standardization and easier management is gaining momentum in data centers.
White Paper Virtualized SAP: Optimize Performance with Cisco Data Center Virtual Machine Fabric Extender and Red Hat Enterprise Linux and Kernel-Based Virtual Machine What You Will Learn The virtualization
SQL Server Consolidation Using Cisco Unified Computing System and Microsoft Hyper-V
SQL Server Consolidation Using Cisco Unified Computing System and Microsoft Hyper-V White Paper July 2011 Contents Executive Summary... 3 Introduction... 3 Audience and Scope... 4 Today s Challenges...
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
HADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
Improving IT Operational Efficiency with a VMware vsphere Private Cloud on Lenovo Servers and Lenovo Storage SAN S3200
Improving IT Operational Efficiency with a VMware vsphere Private Cloud on Lenovo Servers and Lenovo Storage SAN S3200 Most organizations routinely utilize a server virtualization infrastructure to benefit
SMB Direct for SQL Server and Private Cloud
SMB Direct for SQL Server and Private Cloud Increased Performance, Higher Scalability and Extreme Resiliency June, 2014 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server
Cisco Application Networking Manager Version 2.0
Cisco Application Networking Manager Version 2.0 Cisco Application Networking Manager (ANM) software enables centralized configuration, operations, and monitoring of Cisco data center networking equipment
MESOS CB220. Cluster-in-a-Box. Network Storage Appliance. A Simple and Smart Way to Converged Storage with QCT MESOS CB220
MESOS CB220 Cluster-in-a-Box Network Storage Appliance A Simple and Smart Way to Converged Storage with QCT MESOS CB220 MESOS CB220 A Simple and Smart Way to Converged Storage Tailored for SMB storage
SQL on NoSQL (and all of the data) With Apache Drill
SQL on NoSQL (and all of the data) With Apache Drill Richard Shaw Solutions Architect @aggress Who What Where NoSQL DB Very Nice People Open Source Distributed Storage & Compute Platform (up to 1000s of
Virtualizing Apache Hadoop. June, 2012
June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
Information Builders Mission & Value Proposition
Value 10/06/2015 2015 MapR Technologies 2015 MapR Technologies 1 Information Builders Mission & Value Proposition Economies of Scale & Increasing Returns (Note: Not to be confused with diminishing returns
HP reference configuration for entry-level SAS Grid Manager solutions
HP reference configuration for entry-level SAS Grid Manager solutions Up to 864 simultaneous SAS jobs and more than 3 GB/s I/O throughput Technical white paper Table of contents Executive summary... 2
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Unified Computing Systems
Unified Computing Systems Cisco Unified Computing Systems simplify your data center architecture; reduce the number of devices to purchase, deploy, and maintain; and improve speed and agility. Cisco Unified
Online Transaction Processing in SQL Server 2008
Online Transaction Processing in SQL Server 2008 White Paper Published: August 2007 Updated: July 2008 Summary: Microsoft SQL Server 2008 provides a database platform that is optimized for today s applications,
Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved.
Saving Millions through Data Warehouse Offloading to Hadoop Jack Norris, CMO MapR Technologies MapR Technologies. All rights reserved. MapR Technologies Overview Open, enterprise-grade distribution for
Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra
Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra A Quick Reference Configuration Guide Kris Applegate [email protected] Solution Architect Dell Solution Centers Dave
White Paper. Managing MapR Clusters on Google Compute Engine
White Paper Managing MapR Clusters on Google Compute Engine MapR Technologies, Inc. www.mapr.com Introduction Google Compute Engine is a proven platform for running MapR. Consistent, high performance virtual
The Future of Data Management with Hadoop and the Enterprise Data Hub
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah Cofounder & CTO, Cloudera, Inc. Twitter: @awadallah 1 2 Cloudera Snapshot Founded 2008, by former employees of Employees
ORACLE DATABASE 10G ENTERPRISE EDITION
ORACLE DATABASE 10G ENTERPRISE EDITION OVERVIEW Oracle Database 10g Enterprise Edition is ideal for enterprises that ENTERPRISE EDITION For enterprises of any size For databases up to 8 Exabytes in size.
Enabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads
89 Fifth Avenue, 7th Floor New York, NY 10003 www.theedison.com @EdisonGroupInc 212.367.7400 IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads A Competitive Test and Evaluation Report
Copyright 2012, Oracle and/or its affiliates. All rights reserved.
1 Oracle Big Data Appliance Releases 2.5 and 3.0 Ralf Lange Global ISV & OEM Sales Agenda Quick Overview on BDA and its Positioning Product Details and Updates Security and Encryption New Hadoop Versions
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
Microsoft Private Cloud Fast Track
Microsoft Private Cloud Fast Track Microsoft Private Cloud Fast Track is a reference architecture designed to help build private clouds by combining Microsoft software with Nutanix technology to decrease
IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look
IBM BigInsights Has Potential If It Lives Up To Its Promise By Prakash Sukumar, Principal Consultant at iolap, Inc. IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based
EMC Backup and Recovery for Microsoft SQL Server 2008 Enabled by EMC Celerra Unified Storage
EMC Backup and Recovery for Microsoft SQL Server 2008 Enabled by EMC Celerra Unified Storage Applied Technology Abstract This white paper describes various backup and recovery solutions available for SQL
SummitStack in the Data Center
SummitStack in the Data Center Abstract: This white paper describes the challenges in the virtualized server environment and the solution that Extreme Networks offers a highly virtualized, centrally manageable
HDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)
WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...
Oracle Database 12c Plug In. Switch On. Get SMART.
Oracle Database 12c Plug In. Switch On. Get SMART. Duncan Harvey Head of Core Technology, Oracle EMEA March 2015 Safe Harbor Statement The following is intended to outline our general product direction.
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014
Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ Cloudera World Japan November 2014 WANdisco Background WANdisco: Wide Area Network Distributed Computing Enterprise ready, high availability
Block based, file-based, combination. Component based, solution based
The Wide Spread Role of 10-Gigabit Ethernet in Storage This paper provides an overview of SAN and NAS storage solutions, highlights the ubiquitous role of 10 Gigabit Ethernet in these solutions, and illustrates
