IBM System x reference architecture for Hadoop: MapR

Size: px
Start display at page:

Download "IBM System x reference architecture for Hadoop: MapR"

Transcription

1 IBM System x reference architecture for Hadoop: MapR May 2014 Beth L Hoffman and Billy Robinson (IBM) Andy Lerner and James Sun (MapR Technologies) Copyright IBM Corporation, 2014

2 Table of contents Introduction... 1 Business problem and business value... 1 Business problem... 1 Business value... 2 Requirements... 2 Functional requirements... 2 Nonfunctional requirements... 4 Architectural overview... 5 Component model... 6 Operational model Cluster nodes Networking Predefined configurations Number of nodes for MapR management services Deployment diagram Deployment considerations Systems management Storage considerations Performance considerations Scaling considerations High availability considerations Migration considerations Appendix 1: Bill of material Node Administration / Management network switch Data network switch Rack Cables Resources Trademarks and special notices... 33

3 Introduction This reference architecture is a predefined and optimized hardware infrastructure for MapR M7 Edition, a distribution of Hadoop with value added capabilities produced by MapR Technologies. The reference architecture provides a predefined hardware configuration for implementing MapR M7 Edition on IBM System x hardware and IBM networking products. MapR M7 Edition is a complete Hadoop distribution supporting MapReduce, HBase, and Hadoop ecosystem workloads. MapReduce is a core component of Hadoop that provides an off-line, batch-oriented framework for high-throughput data access and distributed computation. The predefined configuration provides a baseline configuration for a MapR cluster which can be modified based on the specific customer requirements, such as lower cost, improved performance, and increased reliability. The intended audience of this document is IT professionals, technical architects, sales engineers, and consultants to assist in planning, designing and implementing the MapR solution on IBM System x. It is assumed that you have existing knowledge of Apache Hadoop components and capabilities. The Resources section provides links to Hadoop information. Business problem and business value This section describes the business problem associated with big data environments and the value that the MapR solution on IBM System x offers. Business problem Every day, around 2.5 quintillion bytes of data is created so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone global positioning system (GPS) signals to name a few. This data is big data. Big data spans three dimensions: Volume Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information. Velocity Often time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business. Variety Big data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more. Big data is more than a challenge; it is an opportunity to find insight in new and emerging types of data to make your business more agile, and to answer questions that, in the past, were beyond reach. Until now, there was no practical way to harvest this opportunity. Today, MapR uses the latest big data technologies, such as the massive map-reduce scale-out capabilities of Hadoop to open the door to a world of possibilities. 1

4 Business value MapR provides a Hadoop-based big data solution that is easy to manage, dependable, and fast. MapR eliminates the complexity of setting up and managing Hadoop. MapR provides alerts, alarms, and insights through an advanced graphical interface. MapR Heatmap provides a clear view of cluster health and performance, and MapR volumes simplify data security, retention, placement, and quota management. MapR provides Direct Access NFS. This allows users to mount the entire Hadoop cluster as an NFS volume that simplifies how an application can write to and read from a Hadoop cluster. MapR provides a high level of availability including support for rolling upgrades, self-healing and automated stateful failover. MapR also provides dependable data storage with full data protection and business continuity features including snapshots and mirroring. MapR s enhanced security features ensure strong user authentication and authorization. Data confidentiality and integrity is also enforced whether it is in motion or at rest in Hadoop cluster. Unique MapR functions improve MapReduce throughput. MapR deployed on IBM System x servers with IBM networking components provides superior performance, reliability, and scalability. The reference architecture supports entry through high-end configurations and the ability to easily scale as the use of big data grows. A choice of infrastructure components provides flexibility in meeting varying big data analytics requirements. Requirements The functional and nonfunctional requirements for the MapR reference architecture are described in this section. Functional requirements The key functional requirements for a big data solution include: Support for a variety of application types, including batch and real-time analytics Support for industry-standard interfaces, so that existing applications can work with MapR Support for real-time streaming and processing of data Support for a variety of data types and databases Support for a variety of client interfaces Support for large volumes of data MapR supports mission-critical and real-time big data analytics across different industries. MapR is used across financial services, retail, media, healthcare, manufacturing, telecommunications and government organizations and by leading Fortune 100 and Web 2.0 companies. The MapR platform for big data can be used for a variety of use cases from batch applications that use MapReduce with data source such as clickstreams to real-time applications that use sensor data. The MapR platform for Apache Hadoop integrates a growing set of functions, including MapReduce, file-based applications, interactive SQL, NoSQL databases, search and discovery, and real-time stream processing. With MapR, data does not need to be moved to specialized silos for processing; data can be processed in place. 2

5 Figure 1 shows MapR key capabilities that meet the functional requirements stated earlier. Figure 1: MapR key capabilities MapReduce MapR provides the required performance for MapReduce operations on Hadoop and publishes performance benchmarking results. The MapR architecture is built in C/C++ and harnesses distributed metadata with an optimized shuffle process, enabling MapR to deliver consistent high performance. File-based applications MapR is a 100% Portable Operating System Interface (POSIX) compliant system that fully supports random read-write operations. By supporting an industry-standard Network File System (NFS), users can mount a MapR cluster and run any file-based application, written in any language, directly on the data residing in the cluster. All standard tools in the enterprise including browsers, UNIX tools, spreadsheets, and scripts can access the cluster directly without any modifications. SQL There are a number of applications that support SQL access against data contained in MapR including Hive, Hadapt and others. MapR is also leading the development of Apache Drill that brings American National Standards Institute (ANSI) SQL capabilities to Hadoop. Database MapR has removed the trade-offs that organizations face when looking to deploy a NoSQL solution. Specifically, MapR delivers ease of use, dependability, and performance advantages for HBase applications. MapR provides scalability, strong consistency, reliability, and continuous low latency with an architecture that does not require compactions or background consistency checks. Search MapR integrates enterprise-grade search. On a single platform, customers can now perform predictive analytics, full search and discovery; and conduct advanced database operations. The MapR enterprise- 3

6 grade search capability works directly on Hadoop data but can also index and search standard files without having to perform any conversion or transformation. Stream processing MapR provides a simplified architecture for real-time stream computational engines such as Storm. Streaming data feeds can be written directly to the MapR platform for Hadoop for long-term storage and MapReduce processing. Nonfunctional requirements Customers require their big data solution to be easy, dependable, and fast. Here is a list of the key nonfunctional requirements: Easy Dependable Fast Ease of development Easy management at scale Advanced job management Multitenancy Data protection with snapshot and mirroring Automated self-healing Insight into software/hardware health and issues High availability (HA) and business continuity (99.999% uptime) Superior performance Scalability Secure Strong authentication and authorization Kerberos support Data confidentiality and integrity The MapR solution provides features and capabilities that fulfill the nonfunctional requirements requested by customers. The blue boxes in Figure 1 illustrate the key MapR capabilities that meet the nonfunctional requirements. MapR provides dependability, ease-of-use, and speed to Hadoop, NoSQL, database, and streaming applications in one unified big data platform. This full range of applications and data sources benefit from MapR enterprise-grade platform and unified architecture for files and tables. The MapR platform provides high availability, data protection and disaster recovery to support mission-critical applications. The MapR platform also makes it easier to use existing applications and solutions by supporting industry-standard interfaces, such as NFS. To support a diverse set of applications and users, MapR also provides multitenancy features and volume support. These features include support for heterogeneous hardware within a cluster and data and job placement control 4

7 so that applications can be selectively ran in a cluster to take advantage of faster processors or solid-state drives (SSDs). The subsequent sections in this document describe the reference architecture that meets the business needs and the functional and nonfunctional requirements. Architectural overview Figure 2 shows the main features of the MapR IBM System x reference architecture. Users can log in to the MapR client from outside the firewall using Secure Shell (SSH) on port 22 to access the MapR solution in the corporate network. MapR provides several interfaces that allow administrators and users to perform administration and data functions depending on their roles and access level. Hadoop application programming interfaces (APIs) can be used to access data. MapR APIs can be used for cluster management and monitoring. MapR data services, management services, and other services run on the nodes in cluster. Storage is a component of each node in the cluster. Data can be ingested into the MapR Hadoop storage either through the Hadoop APIs or NFS depending on the needs of the customer. Figure 2: MapR architecture overview 5

8 Component model MapR is a complete distribution that includes Spark, Shark, HBase, Pig, Hive, Mahout, Cascading, Sqoop, Flume, Storm and more. MapR distribution is 100% API-compatible with Hadoop (MapReduce, HDFS and HBase). The M5 Edition includes advanced high availability and data protection features, such as JobTracker HA, No NameNode HA, snapshots, and mirroring. The M7 Edition supports enterprise mission-critical deployments. MapR M7 includes the following features: MapR Control System and Heatmap: The MapR Control System (MCS) provides full visibility into cluster resources and activity. The MapR MCS dashboard includes the MapR Heatmap that provides visual insight into node health, service status, and resource utilization, organized by the cluster topology (for example, data centers and racks). Designed to manage large clusters with thousands of nodes, the MapR Heatmap shows the health of the entire cluster at a glance. Filters and group actions are also provided to select specific components and perform administrative actions directly because the number of nodes, files, and volumes can be very high. The Heatmap interfaces are designed for managing the smallest to the largest clusters, but also include command-line interface (CLI) and Representational State Transfer (REST) access as well. MapR No NameNode high availability: MapR Hadoop distribution is unique because it was designed for high availability. MapR is the only Hadoop distribution designed with no single point of failure. Other Hadoop distributions have a single primary NameNode and when the NameNode goes down, the entire cluster becomes unavailable until the NameNode is restarted. In those cases where other distributions are configured with multiple NameNodes, the entire cluster becomes unavailable during the failover to a secondary NameNode. With MapR, file metadata is replicated and distributed, so that there is no data loss or downtime even in the face of multiple disk or node failures. MapR JobTracker high availability: The MapR JobTracker HA improves recovery time objectives and provides for a self-healing cluster. Upon failure, the MapR JobTracker automatically restarts on another node in the cluster. TaskTrackers can automatically pause and then reconnect to the new JobTracker. Any currently running jobs or tasks continue without losing any progress or failing. MapR storage services: MapR stores data in a distributed shared system that eliminates contention and the expense from data transport and retrieval. Automatic, transparent client-side compression reduces network resources and reduces footprint on disk, while direct block device I/O provides throughput at hardware speed without additional resources. As an additional performance boost, with MapR, you can read files while they are still being written. MapR No NameNode architecture scales linearly with the number of nodes, providing unlimited file support. You need to add nodes to increase the number of files supported to more than a trillion files containing over 1000 PB of data. MapR Direct Access NFS: MapR Direct Access NFS makes Hadoop radically easier and less expensive to use by letting the user mount the Hadoop file system from a standard NFS client. Unlike the write-once system found in other Hadoop distributions, MapR allows files to be modified and overwritten, and enables multiple concurrent reads and 6

9 writes on any file. Users can browse files, automatically open associated applications with a mouse click, or drag files and directories into and out of the cluster. Additionally, standard command-line tools and UNIX applications and utilities (such as grep, tar, sort, and tail) can be used directly on data in the cluster. With other Hadoop distributions, the user must copy the data out of the cluster in order to use standard tools. MapR job metrics: The MapR job metrics service provides in-depth access to the performance statistics of your cluster and the jobs that run on it. With MapR job metrics, you can examine trends in resource use, diagnose unusual node behavior, or examine how changes in your job configuration affect the job's execution. MapR snapshots: MapR provides snapshots that are atomic and transactionally consistent. MapR snapshots provide protection from user and application errors with flexible schedules to accommodate a range of recovery point objectives. MapR snapshots can be scheduled or performed on demand. Recovering from a snapshot is as easy as dragging the directory or files to the current directory. MapR snapshots offer high performance and space efficiency. No data is copied in order to create a snapshot. As a result, a snapshot of a petabyte volume can be performed in seconds. A snapshot operation does not have any impact on write performance because MapR uses redirecton-write to implement snapshots. All writes in MapR goes to new blocks on disk. This means that a snapshot needs to retain references to the old blocks and does not require copying data blocks. MapR mirroring: MapR makes data protection easy and built-in. Going far beyond replication, MapR mirroring means that you can set policies around your recovery time objectives (RTO) and mirror your data automatically within your cluster, between clusters, (such as a production and a research cluster) or between sites. MapR volumes: MapR volumes make cluster data both easy to access and easy to manage by grouping related files and directories into a single tree structure so they can be easily organized, managed, and secured. MapR volumes provide the ability to apply policies including the following: replication factor, scheduled mirroring, scheduled snapshots, data placement and topology control, quotas and usage tracking, administrative permissions. MapR tables: MapR-FS enables you to create and manipulate tables in many of the same ways that you create and manipulate files in a standard UNIX file system. MapR tables are implemented directly within MapR-FS, yielding a familiar and open-standards API that provides a high-performance data store for tables. MapR-FS is written in C and optimized for performance. As a result, MapR-FS runs significantly faster than JVM-based Apache HBase. With M7, there are no region servers, additional processes, or any redundant layer between the application and the data residing in the cluster. M7 s zero administration approach includes automatic region splits and self-tuning with no downtime required for any operation including schema changes. MapR's implementation of the HBase API provides enterprise-grade high availability (HA), data protection, and disaster recovery features for tables on a distributed Hadoop cluster. MapR tables can be used as the underlying key-value store for Hive, or any other application requiring a high-performance, high-availability key-value data store. Because 7

10 MapR tables are API-compatible with HBase, many legacy HBase applications can continue to run without modification. These features are implemented by a set of MapR components as shown in the MapR component model in Figure 3. Figure 3: MapR component model MapR includes several open source projects many of which are shown in the Hadoop ecosystem box: Apache Flume A distributed, reliable, and highly available service for efficiently moving large amounts of data around a cluster. Apache Spark A fast and general engine for large-scale data processing. Shark An open source distributed SQL query engine for Hadoop data. It brings state-ofthe-art performance and advanced analytics to Hive users. Apache Hadoop A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Apache HBase The Hadoop database, a distributed, scalable, big data store. Apache HCatalog A table and storage management service for data created using Apache Hadoop. Apache Hive A data warehouse system for Hadoop that facilitates easy data summarization, ad hoc queries, and the analysis of large data sets stored in Hadoop compatible file systems. Apache Mahout A scalable machine learning library. Apache Oozie A workflow coordination manager Apache Pig A language and runtime for analyzing large data sets, consisting of a highlevel language for expressing data analysis programs and an infrastructure for evaluating those programs. 8

11 Apache Sqoop A tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. Apache Whirr A set of libraries for running cloud services. Apache ZooKeeper A distributed service for maintaining configuration information, providing distributed synchronization, and providing group service. Cascading An application framework for Java developers to quickly and easily develop robust data analytics and data management applications on Apache Hadoop. MapR data services All cluster nodes can run data or worker services, which run the MapReduce tasks over a subset of the data. Cluster nodes also store the MapR File System (MapR-FS) data. In large clusters, some nodes may be dedicated to management services. Each node hosts the following data services: MapR-FS - Provides distributed file services. TaskTracker - Runs map and reduces tasks for MapReduce clusters. HBase RegionServer (optional) - Maintains and serves table regions in an HBase cluster. MapR management services MapR management services can run on any node. In multi-rack configurations, these services should be on nodes spread across racks. Here is the list of MapR management services: Container Location Data Base (CLDB) Manages and maintains container location information and replication. JobTracker Is a Hadoop service that farms out MapReduce tasks to specific nodes in a MapReduce cluster. ZooKeeper Coordinates activity and keeps track of management services locations on the cluster. HBase Master (optional) - Monitors all RegionServer instances in an HBase cluster and manages metadata changes. For high availability, more than one node can be configured to run HBase Master. In this case, only one HBase Master is active while additional HBase Masters are available to take over HBase management if the active HBase Master fails. Other optional MapR services MapR offers two other optional services that can be run on multiple nodes in the cluster: NFS server (Gateway) Provides NFS access to the distributed file system. The NFS server is often run on all nodes in the cluster to allow local mounting of the cluster file system from any node. WebServer Provides MapR Control System graphical user interface (GUI) and REST API. MapR provides support for many client interfaces, several of which were described in the architecture overview or feature list. Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC) can be used to access data in the MapR Hadoop cluster. A MapR CLI provides an additional way to manage the cluster and services. 9

12 Hadoop and the MapR solution are operating system independent. MapR supports many Linux operating systems. Red Hat Linux and SUSE Linux are supported with the IBM System x reference architecture. The details about the versions can be found at: Operational model This section describes the operational model for the MapR reference architecture. The operational model focuses on a cluster of nodes to support a given amount of data. In order to illustrate the operational model for different sized customer environments, four different models are provided for supporting different amounts of data. Throughout the document, these will be referred to as starter rack, half rack, full rack, and multirack configuration sizes. Note that the operational model for the multirack size is three times larger than the full rack size, and therefore, data amounts between them can be extrapolated by using different multiples of the full rack operational model. A predefined configuration for the MapR solution consists of cluster nodes, networking, power, and a rack. The predefined configurations can be implemented as is or modified based on specific customer requirements, such as lower cost, improved performance, and increased reliability. Key workload requirements such as the data growth rate, sizes of datasets, and data ingest patterns help in determining the proper configuration for a specific deployment. A best practice when designing a MapR cluster infrastructure is to conduct the necessary testing and proof of concepts against representative data and workloads to ensure that the proposed design achieves the necessary success criteria. Cluster nodes The MapR reference architecture is implemented on a set of nodes that make up a cluster. Nodes are implemented on System x3650 M4 BD servers with locally attached storage. MapR runs well on a homogenous server environment with no need for different hardware configurations for management and data services. Server nodes can run three different types of services: Data (worker) services for storing and processing data Management (control) services for coordinating and managing the cluster Miscellaneous services (optional) for file and web serving Unlike other Hadoop distributions that require different server configurations for management nodes and data nodes, the MapR reference architecture requires only a single MapR node hardware configuration. Each node is then configured to run one or more of the mentioned services. In large clusters, a node could be dedicated to running only management services. Each node in the cluster is an IBM System x3650 M4 BD server. Each predefined node is made up of the components as shown in Table 1. The Intel Xeon processor E v2 is recommended to provide sufficient performance. A minimum of 64 GB of memory is recommended for most MapReduce workloads with 128 GB or more recommended for HBase, Spark, and memory-intensive MapReduce workloads. A choice of 3 TB and 4 TB drives is suggested depending on the amount of data that needs to be stored. For 10

13 the hard disk drive (HDD) controller, JBOD (just a bunch of disks) is the best choice for a MapR cluster. It provides excellent performance and, when combined with the Hadoop default of 3x data replication, also provides significant protection against data loss. The use of Redundant Array of Independent Disks (RAID) with data disks is discouraged with MapR. MapR provides an automated way to set up and manage storage pools. (RAID can be used to mirror the OS, which is described in a later section.) Nodes can be customized according to client needs. Component System Processor Memory - base Disk (OS) a Disk (data) b HDD controller Hardware storage protection Hardware management network adapter Data network adapter Predefined configuration System x3650 M4 BD 2 x Intel Xeon processor E v2 2.6GHz 8-core 64 GB 8 x 8GB 1866MHz RDIMM (minimum) 3 TB drives: 2 x 3TB NL SATA 3.5 inch or 4 TB drives: 2 x 4TB NL SATA 3.5 inch 3 TB drives: 12 x 3TB NL SATA 3.5 inch (36 TB Total) or 4 TB drives: 12 x 4TB NL SATA 3.5 inch (48 TB Total) 6Gb JBOD controller None (JBOD). By default, MapR maintains a total of three copies of data stored within the cluster. The copies are distributed across data servers and racks for fault recovery. Integrated 1GBaseT Adapter Mellanox ConnectX-3 EN Dual-port SFP+ 10GbE Adapter Table 1: Node predefined configuration OS drives may be of a smaller size than the data drives. OS drives are used for the operating system, the MapR application, and software monitoring. A separate RAID controller can be configured for the two OS drives. Data drives should be of the same size, either all 3 TB or all 4 TB. The reference architecture recommends the storage-rich System x3650 M4 BD model for several reasons: Storage capacity The nodes are storage-rich. Each of the fourteen 3.5 inch drives has raw capacity up to 4 TB for a total of 56 TB per node and over a petabyte per rack. Performance This hardware supports the latest Intel Xeon processors based on the Intel Ivy Bridge microarchitecture. Flexibility Server hardware uses embedded storage resulting in simple scalability; just add more nodes. More PCIe slots Up to two PCIe slots are available. They can be used for network adapter redundancy and increased network throughput. Better power efficiency The server offers a common form factor (CFF) power supply unit (PSU) including Platinum 80 plus options. 11

14 Networking With respect to networking, the reference architecture specifies two networks: a data network and an administrative/management network. Figure 4 shows the networking configuration for MapR installed on a cluster with one rack. Data network The data network is a private cluster data interconnect among nodes used for data access, moving data across nodes within a cluster, and ingesting data into the MapR file system. The MapR cluster typically connects to the customer s corporate data network. One top of rack switch is required for the data network used by MapR. Either a 1GbE or 10GbE switch can be used. A 1Gb Ethernet switch is sufficient for some workloads. A 10Gb Ethernet switch can provide extra I/O bandwidth for added performance. This rack switch for the data network has two physical, aggregated links connected to each of the nodes in the cluster. The data network is a private virtual local area network (VLAN) or subnet. The two Mellanox 10GbE ports of each node can be link aggregated to the recommended G8264 rack switch for better performance and improved HA. Alternatively, MapR can automatically take advantage of multiple data networks provided by multiple switches. Even when only one switch is used, because it has ports available and cluster nodes have multiple data network interfaces, it can be configured with multiple subnets to allow multiple network connections to each node. MapR automatically uses additional network connections to increase bandwidth. The recommended 10GbE switch is the IBM System Networking RackSwitch G8264. The recommended 1GbE switch is the IBM RackSwitch G8052. The enterprise-level IBM RackSwitch G8264 has the following characteristics: Up to sixty-four 1Gb/10Gb SFP+ ports Four 40 Gb QSFP+ ports Support for 1.28 Tbps of non-blocking throughput Energy-efficient, cost-effective design Optimized for applications requiring high bandwidth and low latency The customer can optionally use edge nodes to exchange data within the cluster. Edge nodes can also help control access to the cluster and can be configured to control access to the data network, administration network, or both. 12

15 Hardware management network The hardware management network is a 1GbE network used for in-band OS administration and out-ofband hardware management. In-band administrative services, such as SSH or Virtual Network Computing (VNC), running on the host operating system allows administration of cluster nodes. Out-of-band management, through the integrated management modules II (IMM2) within the System x3650 M4 BD server, allows hardware-level management of cluster nodes, such as node deployment or basic input/output system (BIOS) configuration. Hadoop has no dependency on the integrated management modules (IMM2). Based on customer requirements, the administration links and management links can be segregated onto separate VLANs or subnets. The administrative/management network is typically connected directly to the customer s administrative network. When the in-band administrative services on the host operating system are used, MapR should be configured to only use the data network. By default, MapR uses all the available network interfaces. The reference architecture requires one 1Gb Ethernet top of rack switch for the hardware management network. The two 10Gb uplinks between the G8052 and G8264 top of rack switches (as shown in Figure 4) are optional, however they can be used in customer environments that require faster routing and access over the administration network to all the nodes in the cluster. Administration users are also able to access all the nodes in the cluster through the customer admin network, as shown in Figure 5. This rack switch for the hardware management network is connected to each of the nodes in the cluster using two physical links (one for in-band OS administration and one link for out-of-band IMM2 hardware management). On the nodes, the administration link should connect to port 1 on the integrated 1GBaseT adapter and the management link should connect to the dedicated IMM2 port. The recommended switch is IBM RackSwitch G8052 with the following features: Forty-eight 1 GbE RJ45 ports Four standard 10 GbE SFP+ ports Low 130W power rating and variable speed fans to reduce power consumption 13

16 4x 40Gb Uplinks Mgmt port G8264 (1 required 2 nd for HA) 44 10Gb ports used 4 uplinks reserved for Scale out 10Gb links Customer Network Data 1Gb link Mgmt services Node 10Gb links Data Data services 1Gb link Admin 1Gb link IMM 2x 10Gb Uplinks G x 1Gb ports used 2x 10Gb ports used 1Gb link Customer Network Admin Rack for MapR Solution 40 GB Uplink to core switch Customer Data Network Data Network Administration/IMM Network Customer Administration Network Figure 4: MapR rack network configuration Multirack network The data network in the predefined reference architecture configuration consists of a single network topology. The single G8264 data network switch within each rack represents a single point of failure. Addressing this challenge can be achieved by building a redundant data network using an additional IBM RackSwitch G8264 top of rack switch per rack and appropriate additional IBM RackSwitch G8316 core switches per cluster. In this case, the second Mellanox 10GbE port can be connected to the second IBM RackSwitch G8264. Figure 5 shows how the network is configured when the MapR cluster is installed across more than one rack. The data network is connected across racks by two aggregated 40GbE uplinks from each rack s G8264 switch to a core G8316 switch. A 40GbE switch is recommended for interconnecting the data network across multiple racks. IBM System Networking RackSwitch G8316 is the recommended switch. A best practice is to have redundant core switches for each rack to avoid a single point of failure. Within each rack, the G8052 switch can optionally be configured to have two uplinks to the G8264 switch to allow propagation of the administrative/management VLAN across cluster racks through the G8316 core switch. For large clusters, the IBM System Networking RackSwitch G8332 is recommended because it provides a better cost value per 40Gb port than the G8316. Figure 6 shows a large cluster example using the G8332 core switch. 14

17 Many other cross rack network configurations are possible and may be required to meet the needs of specific deployments or to address clusters larger than three racks. If the solution is initially implemented as a multirack solution, or if the system grows by adding additional racks, the nodes that provide management services should be distributed across racks to maximize fault tolerance. Figure 5: MapR cross rack network configuration 15

18 Figure 6: MapR large cluster configuration Predefined configurations Four predefined configurations for the MapR reference architecture are highlighted in Table 2. The table shows the amount of space for data and the number of nodes that each predefined configuration provides. Storage space is described in two ways; the total amount of raw storage space when using 3 TB or 4 TB drives (raw storage) and the amount of space for the data the customer has (available data space). Available data space assumes the use of Hadoop replication with three copies of the data, and 25% capacity reserved for efficient file system operation and to allow time to increase capacity if needed. Available data space might increase significantly with MapR automatic compression. The estimates in Table 2 does not include additional space freed up by using compression because compression rates can vary widely based on file contents. Starter rack Half rack Full rack Multirack Storage space using 3 TB drives Raw storage 108 TB 360 TB 720 TB 2,160 TB Available data space 27 TB 90 TB 180 TB 540 TB Storage space using 4 TB drives Raw storage 144 TB 480 TB 960 TB 2,880 TB Available data space 36 TB 120 TB 240 TB 720 TB Number of nodes Number of nodes Table 2: Predefined configurations 16

19 The number of nodes that are required in the cluster to support these four predefined configurations are shown in Table 2. These are the estimates for highly available clusters. Three nodes are required to support a customer deployment that has 36 TB of data. Ten nodes are needed to support a customer deployment that has 120 TB of data, and so on. When estimating disk space within a MapR Hadoop cluster, the following must be considered: For improved fault tolerance and improved performance, the MapR file system replicates data blocks across multiple cluster data nodes. By default, the file system maintains three replicas. Compression ratio is an important consideration in estimating disk space and can vary greatly based on file contents. MapR provides automatic compression. Available data space might increase significantly with MapR automatic compression. If the customer s data compression ratio is not available, assume a compression ratio of 2.5. To ensure efficient file system operation and to allow time to add more storage capacity to the cluster if necessary, reserve 25% of the total capacity of the cluster. Assuming the default three replicas maintained by the MapR file system, the raw data disk space, and the required number of nodes can be estimated using the following equations: Total raw data disk space = (User data, uncompressed) * (4 / compression rate) Total required nodes = (Total raw data disk space) / (Raw data disk per node) You should also consider future growth requirements when estimating disk space. Based on these sizing principals, Table 3 demonstrates an example for a cluster that needs to store 250 TB of uncompressed user data. The example shows that the MapR cluster will need to have 400TB of raw disk to support 250 TB of uncompressed data. The 400 TB is for data storage and does not include OS disk space. Nine nodes, or nearly a half rack, would be required to support deployment of this size. Description Value Size of uncompressed user data 250 TB Compression rate 2.5x Size of compressed data 100 TB Storage multiplication factor 4 Raw data disk space needed for MapR cluster 400 TB - Storage needed for MapR-FS 3x replication 300 TB - Storage reserved for headroom 100 TB Raw data disk per node (with 4 TB drives) 48 TB Minimum number of nodes required 9 Table 3: Example of storage sizing with 4 TB drives 17

20 Number of nodes for MapR management services The nodes run MapR data services, management services, and other optional services. The number of nodes recommended for running management services and data services vary based on the size of the configuration. MapR is very flexible in its ability to use any node for any management service. Depending on workloads and HA requirements, multiple nodes could be dedicated to a single management service, multiple management services, or both management and data services. The number of nodes running management services can be customized based on specific workloads and HA requirements. Table 4 shows the number of nodes that should run management services depending on the size of cluster. Number of nodes Subset of nodes running management services Breakout of function < Dedicated management nodes are not required. Management services run on nodes that also run data and optional services. CLDB on two or three nodes. JobTracker or HBase Master on two or three nodes. ZooKeeper on three nodes (run ZooKeeper on an odd number of nodes). Reduce the number of task slots on servers running both data and management services to ensure the processor and memory resources are available to management services. WebServer and NFS server can also be run on nodes running management services. For faster failover recovery times, avoid running ZooKeeper and CLDB services on the same servers Dedicated management nodes are not required. Management services run on nodes that also run data and optional services. CLDB on two to four nodes. JobTracker or HBase Master on two or three nodes. ZooKkeeper on three or five nodes (run ZooKeeper on an odd number of nodes). Reduce the number of task slots on servers running both data and management services to ensure that the processor and memory resources are available to management services. Web server and NFS server can also be run on nodes running management services. Spread management services across racks and across nodes to avoid running an instance of each management service on a single node. For faster failover recovery times, avoid running ZooKeeper and CLDB services on the same servers. 18

21 > or more Dedicate nodes to management services. Do not run data or optional services on the same nodes running management services. CLDB on three or more nodes JobTracker or HBase Master on two or three nodes. ZooKeeper on five nodes (run ZooKeeper on an odd number of nodes). On very large clusters, dedicate nodes to running only CLDB. On very large clusters, dedicate nodes to running only ZooKeeper. Spread management services across racks and across nodes to avoid running an instance of each management service on a single node. Table 4: Number of nodes running MapR management services In clusters up to 100 nodes, management services typically reside on nodes that also provide data services. For a very small cluster that does not require failover or high availability, all the management services and data services can run on one node. However, HA is recommended and requires at least three nodes running management services. Even if high availability is not required, MapR M7 provides snapshots, mirrors, and multiple NFS servers. Also, the HA features of M7 provide a mechanism for administrators to perform rolling upgrades of management processes without any downtime on the cluster. A small cluster with less than 40 nodes should be set up to run the management services on three to five nodes for high availability. A medium-sized cluster should be set up to run the management services on at least five nodes with ZooKeeper, CLDB, and JobTracker nodes distributed across racks. This environment provides failover and high availability for all critical services. For clusters over 100 nodes, some nodes can be dedicated to management services. A large cluster can be set up to run the management services on a minimum of seven nodes with these nodes distributed across racks. In a large cluster, isolate the CLDB service from other services by placing them on dedicated nodes. In addition, in large clusters, ZooKeeper services should be isolated from other services on dedicated nodes. To reduce recovery time upon node failure, avoid running CLDB and ZooKeeper on the same node. Deployment diagram The reference architecture for the MapR solution requires two separate networks; one for hardware management and one for data. Each network requires a rack switch. An additional data network switch can be configured for high availability and increased throughput. One rack switch occupies 1U of space in the rack. Figure 7 shows an overview of the architecture in three different one-rack sized clusters without network redundancy: a starter rack, a half rack and a full rack. Figure 8 shows a multirack-sized cluster without network redundancy. The intent of the four predefined configurations is to ease initial sizings for customers and to show example starting points for different sized workloads. The reference architecture is not limited to these four sized clusters. The starter rack configuration consists of three nodes and a pair of rack switches. The half rack configuration consists of 10 nodes and a pair of rack switches. The full rack 19

22 configuration (a rack fully populated) consists of 20 nodes and a pair of rack switches. The multirack contains a total of 60 nodes; 20 nodes and a pair of switches in each rack. A MapR implementation can be deployed as a multirack solution to support larger workload and deployment requirements. In the Networking section of this document, you can see the networking configuration across multiple nodes and multiple racks. Figure 7: Starter rack, half rack, and full rack MapR predefined configurations 20

23 Figure 8: Multirack MapR predefined configuration Deployment considerations The predefined node configuration can be tailored to match customer requirements. Table 5 shows the common ways to adjust the predefined configuration. The Value options are available for customers who require a cost-effective solution. Performance options are for customers who want top performance. The Enterprise option offers redundancy. A combination of options, such as Value and Enterprise or Performance and Enterprise, can be considered as well. The Intel Xeon processor E v2 is recommended to provide sufficient performance, however smaller or larger processors may be used. A minimum of 64 GB of memory is recommended for most MapReduce workloads with 96 GB or more recommended for HBase and memory-intensive MapReduce workloads. A choice of 3 TB and 4 TB drives is suggested depending on the amount of data that needs to be stored. IBM also offers 2 TB drives. This size may meet the storage density requirements of some big data analytics workloads. SAS drives may be substituted in place of Serial Advanced Technology Attachment (SATA) drives but IBM has found that SATA drives offer the same performance for less cost. 21

24 Description Value options Enterprise options Performance options Node 2 x Intel Xeon processor E v2 2.6 GHz 6- core 1600MHz RDIMM 2 x Intel Xeon processor E v2 2.6 GHz 8- core 1866MHz RDIMM 2 x Intel Xeon processor E v2 2.6 GHz 8-core 1866MHz RDIMM or 2 x E GHz 10- core 1866MHz RDIMM Memory base 64 GB (8 x 8 GB) 64 GB (8 x 8 GB) 64 GB (8 x 8 GB ) or 128 GB (16 x 8GB) Disk (OS) 1 x 3 TB 3.5 inch 2 x 3 TB 3.5 inch (mirrored) Disk (data) 36 TB (12 x 3 TB NL SATA 3.5 inch) or 48 TB (12 x 4 TB NL SATA 3.5 inch) 36 TB (12 x 3TB NL SATA 3.5 inch) or 48 TB (12 x 4TB NL SATA 3.5 inch) 1 x 3 TB 3.5 inch 36TB (12 x 3 TB NL SATA 3.5 inch) or 48TB (12 x 4 TB NL SATA 3.5 inch) HDD controller 6Gb JBOD controller 6Gb JBOD controller 6Gb JBOD controller Available data space* (per node) Data network switch Hardware management network switch 9-22 TB with 3 TB drives or TB with 4 TB drives 1GbE switch with 4 x 10GbE uplinks (IBM G8052) 1GbE switch with 4 x 10GbE uplinks (IBM G8052) Table 5: Additional hardware options Systems management 9-22 TB with 3 TB drives or TB with 4 TB drives Redundant switches 1GbE switch with 4 x 10GbE uplinks (IBM G8052) 9-22 TB with 3 TB drives or TB with 4 TB drives 10GbE switch with 4 x 40GbE uplinks (IBM G8264) 1GbE switch with 4 x 10GbE uplinks (IBM G8052) The mechanism for systems management within the MapR solution is different from other Apache Hadoop distributions. The standard Hadoop distribution places the management services on separate servers than the data servers. In contrast, MapR management services are distributed across the same System x nodes that are used for data services. Storage considerations Each server node in the reference architecture has an internal directly attached storage. External storage is not used in this reference architecture. In situations where higher storage capacity is required, the main design approach is to increase the amount of data disk space per node. Using 4 TB drives instead of 3 TB drives increases the total per node data disk capacity from 36 TB to 48 TB, a 33% increase. Consider using the same size drives for the OS to simplify maintenance to one type of disk drive. When increasing data disk capacity, you must be Available data space assumes the use of Hadoop replication with three copies of the data, and 25% capacity reserved for efficient file system operation and to allow time to increase capacity if needed. Available data space may increase significantly with MapR automatic compression. Because compression can vary widely by file contents, this estimate provides a range from no compression up to 2.5 times compression. Some data may have even greater compression. 22

25 cognizant of the balance between performance and throughput. For some workloads, increasing the amount of user data stored per node can decrease disk parallelism and negatively impact performance. In this case, and when 4 TB drives provide insufficient capacity, higher capacity can be achieved by increasing the number of nodes in the cluster. Performance considerations There are a couple of approaches to increasing cluster performance: increasing node memory and using a high-performance job scheduler and MapReduce framework. Often, improving performance comes at increased cost and you need to consider the cost/benefit trade-offs of designing for higher performance. In the MapR predefined configuration, node memory can be increased to 96 GB by using twelve 8 GB RDIMMs. Even larger memory configuration might provide greater performance depending on the workload. Architecting for lower cost There are two key modifications that can be made to lower the cost of a MapR reference architecture solution. When considering lower-cost options, it is important to ensure that customers understand the potential lower performance implications of a lower cost design. A lower cost version of the MapR reference architecture can be achieved by using lower cost node processors and lower cost cluster data network infrastructure. The node processors can be substituted with the Intel Xeon E v2 2.6 GHz 6-core processor. This processor requires 1600 MHz RDIMMs, which may also lower the per-node cost of the solution. Using a lower cost network infrastructure can significantly lower the cost of the solution, but can also have a substantial negative impact on intra-cluster data throughput and cluster ingest rates. To use a lower cost network infrastructure, use the following substitutions to the predefined configuration: Within each node, substitute the Mellenox 10GbE dual SFP+ network adapter with the additional ports on the integrated 1GBaseT adapters within the System x3630 M4 server. Within each rack, substitute the IBM RackSwitch G8264 top of rack switch with the IBM RackSwitch G8052. Within each cluster, substitute the IBM RackSwitch G8316 core switch with the IBM RackSwitch G8264. Though the network wiring schema is the same as that described in the networking section, the media types and link speeds within the data network have changed. The data network within a rack that connects the cluster nodes to the lower cost option, G8052 top of rack switch, is now based on two aggregated 1GBaseT links per node. The physical interconnect between the admin/management networks and the data networks within each rack is now based on two aggregated 1GBaseT links between the admin/management network G8052 switch and the lower cost data network G8052 switch. Within a cluster, the racks are interconnected through two aggregated 10GbE links between the substitute G8052 data network switch in each rack and a lower cost G8264 core switch. 23

26 Architecting for high ingest rates Architecting for high ingest rates is not a trivial matter. It is important to have a full characterization of the ingest patterns and volumes. The following questions provide guidance to key factors that affect the rates: On what days and at what times are the source systems available or not available for ingest? When a source system is available for ingest, what is the duration for which the system remains available? Do other factors impact the day, time, and duration ingest constraints? When ingests occur, what is the average and maximum size of ingest that must be completed? What factors impact ingest size? What is the format of the source data (structured, semi-structured, unstructured)? Are there any data transformation or cleansing requirements that must be achieved during ingest? Scaling considerations The Hadoop architecture is designed to be linearly scalable. When the capacity of the existing infrastructure is reached, the cluster can be scaled out by simply adding additional nodes. Typically, identically configured nodes are best to maintain the same ratio of storage and compute capabilities. A MapR cluster is scalable by simply adding additional System x3650 M4 BD nodes and network switches and optionally adding management services and optional services on those nodes. MapR No NameNode architecture allows linear scalability to trillions of files and thousands of petabytes. As the capacity of existing racks is reached, new racks can be added to the cluster. It is important to note that some workloads may not scale completely linear. When designing a new MapR reference architecture implementation, future scale out should be a key consideration in the initial design. There are two key aspects to consider: networking and management. Both of these aspects are critical to cluster operation and become more complex as the cluster infrastructure grows. The cross rack networking configuration described in Figure 8 is designed to provide robust network interconnection of racks within the cluster. As additional racks are added, the predefined networking topology remains balanced and symmetrical. If there are plans to scale the cluster beyond one rack, a best practice is to initially design the cluster with multiple racks even if the initial number of nodes would fit within one rack. Starting with multiple racks can enforce proper network topology and prevent future reconfiguration and hardware changes. As racks are added over time, multiple G8316 switches may be required for greater scalability and balanced performance. Also, as the number of nodes within the cluster increases, so do many of the tasks of managing the cluster, such as updating node firmware or operating systems. Building a cluster management framework as part of the initial design and proactively considering the challenges of managing a large cluster will pay off significantly in the long run. xcat, an open source project that IBM supports, is a scalable distributed computing management and provisioning tool that provides a unified interface for hardware control, 24

27 discovery, and operating system deployment. Within the MapR reference architecture, the System x server integrated management modules (IMM2) and the cluster management network provides an out-of-band management framework that management tools, such as xcat, can use to facilitate or automate the management of cluster nodes. Training is required to fully use the capabilities in xcat. See the Resources section for more information about xcat. Proactive planning for future scale out and the development of cluster management framework as a part of initial cluster design provides a foundation for future growth that can minimize hardware reconfigurations and cluster management issues as the cluster grows. High availability considerations When implementing a MapR cluster on System x, consider availability requirements as part of the final hardware and software configuration. Typically, Hadoop is considered a highly reliable solution, but MapR enhancements make it highly available. Hadoop and MapR best practices provide significant protection against data loss. MapR ensures that failures are managed without causing an outage. There is redundancy that can be added to make a cluster even more reliable. Some consideration should be given to both hardware and software redundancy. Networking considerations If network redundancy is a requirement, use an additional switch in the data networks. Optionally, a second redundant switch can be added to ensure high availability of the hardware management network. The hardware management network will not affect the availability of the MapR-FS or Hadoop functionality, but may impact the management of the cluster, and so, availability requirements must be considered. MapR provides application-level Network Interface Card (NIC) bonding for higher throughput and high availability. Customers can either choose MapR application-level bonding or OS-level bonding and switchbased aggregation of some form matching the OS bonding configuration when using multiple NICs. Virtual Link Aggregation Groups (vlag) can be used between redundant switches. vlag is an IBM BNT switch feature. If 1Gbps data network links are used, it is recommended that more than one is used per node to increase throughput. Hardware availability considerations With no single point of failure, redundancy in server hardware components is not required for MapR. MapR automatically and transparently handles hardware failure resulting in the loss of any node in the cluster running any data or management service. MapR s default three-way replication of data ensures that no data is lost because two additional replicas of data are maintained on other nodes in the cluster. MapReduce tasks or HBase region servers from failed nodes are automatically started on other nodes in the cluster. Failure of a node running any management service is automatically and transparently recovered as described in the following services. All ZooKeeper services are available for read operations, with one acting as the leader for all writes. If the node running the leader fails, the remaining nodes will elect a new leader. Most commonly, three ZooKeeper instances are used to allow HA operations. In some large clusters, five ZooKeeper instances are used to allow fully HA operations even during maintenance windows that affect ZooKeeper instances. The number of instances of 25

28 ZooKeeper services that must be run in a cluster depends on the cluster s high availability requirement, but it should always be an odd number. ZooKeeper requires a quorum of (N/2)+1 to elect a leader where N is the total number of ZooKeeper nodes. Running more than five ZooKeeper instances is not necessary. All CLDB services are available for read operations, with one acting as the write master. If the node running the master CLDB service goes down, another running CLDB will automatically become the master. A minimum of two instances is needed for high availability. One JobTracker service is active. Other JobTracker instances are configured but not running. If the active JobTracker goes down, one of the configured instances automatically takes over without requiring any job to restart. A minimum of two instances is needed for high availability. If running HBase on the cluster, one HBase Master service is active. Other HBase Master instances are configured, but not running. If the active HBase Master goes down, one of the configured instances automatically takes over HBase management. A minimum of two instances is needed for high availability. (MapR also offers M7 that provides a native NoSQL database through HBase APIs without requiring any HBase Master or RegionServer processes.) All NFS servers are active simultaneously and can present an HA NFS server to nodes external to the cluster. To do this, specify the virtual IP addresses for two or more NFS servers for NFS high availability. Additionally, use round-robin Domain Name System (DNS) across multiple virtual IP addresses for load balancing in addition to high availability. For NFS access from within the cluster, NFS servers should be run on all nodes in the cluster and each node should mount its local NFS server. MapR WebServer can run on any node in the cluster to run the MapR Control System. The web server also provides a REST interface to all MapR management and monitoring functions. For HA, multiple active web servers can be run with users connecting to any web server for cluster management and monitoring. Note that even with no web server running, all monitoring and management capabilities are available using the MapR command line interface. Within racks, switches and nodes have redundant power feeds with each power feed connected from a separate PDU. 26

29 Storage availability RAID disk configuration is not necessary and should be avoided in MapR clusters. The use of RAID causes a negative impact on performance. MapR provides automated setup and management of storage pools. The three-way replication provided by MapR-FS provides higher durability than RAID configurations because multiple node failures might not compromise data integrity. If the default 3x replication is not sufficient for availability requirements, the replication factor can be increased on a file, volume, or cluster basis. Replication levels higher than 5 are not normally used. Mirroring of MapR volumes within a single cluster can be used to achieve very high replication levels for higher durability or for higher read bandwidth. Mirrors can be used between clusters as well. MapR efficiently mirrors by only copying changes to the mirror. Mirrors are useful for load balancing or disaster recovery. MapR also provides manual or scheduled snapshots of volumes to protect against human error and programming defects. Snapshots are useful for rollback to a known data set. Software availability considerations Operating system availability One of the hard disk drives can be used on each node to mirror the operating system. RAID1 should be used to mirror the two drives. If OS mirroring is not used, the disk that would have been used for OS mirroring is available for MapR data storage. Using the same disk for OS and data is not recommended because it can compromise performance. NameNode availability MapR Hadoop distribution is unique because it was designed with a No NameNode architecture for high availability. MapR is the only Hadoop distribution designed with no single point of failure. Other Hadoop distributions have a single primary NameNode and when the NameNode goes down, the entire cluster becomes unavailable until the NameNode is restarted. In those cases, where other distributions are configured with multiple NameNodes, the entire cluster becomes unavailable during the failover to a secondary NameNode. With MapR, the file metadata is replicated, distributed, and persistent, so that there is no data loss or downtime even in the face of multiple disk or node failures. JobTracker availability The MapR JobTracker HA improves recovery time objectives and provides for a selfhealing cluster. Upon failure, the MapR JobTracker automatically restarts on another node in the cluster. TaskTrackers can automatically pause and then reconnect to the new JobTracker. Any currently running jobs or tasks continue without losing any progress or failing. NFS availability 27

30 You can easily set up a pool of NFS nodes with HA and failover using virtual IP addresses. If one node fails, the virtual IP addresses will be automatically reassigned to the next NFS node in the pool. It is also common to place an NFS server on every node where NFS access to the cluster is needed. Migration considerations If migration of data or applications to MapR is required, consideration needs to be given to the type and amount of data to be migrated and the source of the data being migrated. It is possible to migrate most data types into MapR-FS, but it is necessary to understand the migration requirements to verify viability. Standard Hadoop tools such as distcp can be used to migrate data from other Hadoop distributions. For data in a POSIX file system, you need to NFS mount the MapR cluster and use standard Linux commands to copy the files into the MapR Hadoop cluster. Either Sqoop or database import/export tools with MapR NFS can be used to move data between databases and MapR Hadoop. You also need to consider whether applications need to be modified to take advantage of the Hadoop functionality. With MapR read/write file system that can be mounted by a standard NFS client, significant effort required to migrate applications to other Hadoop distributions can often be avoided. 28

31 Appendix 1: Bill of material This section contains a bill of materials list for each of the core components of the reference architecture. The bill of materials includes the part numbers, component descriptions, and quantities. Table 6 shows how many core components are required for each of the predefined configuration sizes. Size Component Quantity Small Medium Large Node Administration/Management network switch Data network switch Rack Cables Node Administration/Management network switch Data network switch Rack Cables Node Administration/Management network switch Data network switch Rack Cables Very large Node Administration/Management network switch Data network switch Rack Cables Table 6: Mapping between predefined configuration sizes and the bill of materials. Node Part number Description Quantity 5466AC1 -SB- IBM System x3650 M4 BD 1 A4T6 1U Riser Card, Two PCIe x8 Slots 1 A4T7 1U Riser Card, One PCIe x8 slot (For Slotless RAID only) 1 A47H Solarflare SFN5162F 2x10GbE SFP+ Performant Adapter for IBM System x 5977 Select Storage devices no IBM-configured RAID required 1 A3W9F IBM 4TB 7.2K 6Gbps NL SATA 3.5 inch G2HS HDD 14 A4RW -SB- IBM System x 900W High Efficiency Platinum AC Power Supply 1 A4WC -SB- System Documentation and Software-US English

32 A4S4 A4SA A3QG Intel Xeon Processor E v2 8C 2.6GHz 20MB Cache 1866MHz 95W Addl Intel Xeon Processor E v2 8C 2.6GHz 20MB Cache 1866MHz 95W 8GB (1x8GB, 1Rx4, 1.5V) PC CL13 ECC DDR3 1866MHz LP RDIMM A3MW N2115 SAS/SATA HBA for IBM System x 1 A4RQ -SB- x3650 M4 Planar 1 A4RG X3650 M4 BD Chassis ASM w/o Planar m, 10A/ V, C13 to IEC 320-C14 Rack Power Cable 1 A4RR -SB-3.5" Hot Swap BP Bracket Assembly, 12x 3.5" 1 A4RS 3.5" Hot Swap Cage Assembly, Rear, 2 x 3.5"(Cage Only) Rack Installation >1U Component 1 A4RH -SB- BIOS GBM 1 A4RJ -SB- 1U RIASER CAGE - SLOT 2 1 A4RK -SB- 1U BUTTERFLY RIASER CAGE - SLOT 1 1 A4RT -SB- Rear Backplane, 2x3.5" HDD Extension To Front Array (14HDD Array) A4RP -SB- Label GBM 1 A50F -SB- 2x2 HDD BRACKET 1 A48R 2U Bracket for Solarflare Dual Port 10GbE SFP+ Adapter 1 A4TJ -SB- Cable, IPASS, 760mm for 6Gb cards 1 A207 Rail Kit for x3650 M4 BD, x3630 M4 and x3530 M4 1 A20P Package for x3650 M4 BD and x3630 M4 1 A2M3 Shipping Bracket for x3650 M4 BD and x3630 M4 1 Table 7: Sample node bill of materials Administration / Management network switch Part number Description Quantity 7309HC1 IBM System Networking RackSwitch G8052 (Rear to Front) m, 10A/ V, C13 to IEC 320-C14 Rack Power Cable 2 A3KP IBM System Networking Adjustable 19" 4 Post Rail Kit Rack Installation of 1U Component 1 Table 8: Sample administration / management network switch bill of materials. 30

33 Data network switch Part number Description Quantity 7309HC3 IBM System Networking RackSwitch G8264 (Rear to Front) m, 10A/ V, C13 to IEC 320-C14 Rack Power Cable 2 A3KP IBM System Networking Adjustable 19" 4 Post Rail Kit Rack Installation of 1U Component 1 Table 9: Sample data network switch bill of materials Rack Part number Description Quantity 1410RC4 E U rack cabinet DPI Single-phase 60A/208V C13 Enterprise PDU (US) Cluster 1350 Ship Group Rack Assembly - 42U Rack Cluster Hardware and Fabric Verification - 1st Rack 1 Table 10: Sample rack bill of materials Cables Part Number Description Quantity A1PJ 3m IBM Passive DAC SFP+ Cable IntraRack CAT5E Cable Service 2 Table 11: Sample cables bill of materials This bill of materials information is for the United States; part numbers and descriptions may vary in other countries. Other sample configurations are available from your IBM sales team. Components are subject to change without notice. 31

34 Resources IBM System x3650 M4 BD (MapR Node) - On ibm.com: ibm.com/systems/x/hardware/rack/x3650m4bd/ - On IBM Redbooks: ibm.com/redbooks/abstracts/tips0889.html IBM RackSwitch G8052 (1GbE Switch) - On ibm.com: ibm.com/systems/networking/switches/rack/g On IBM Redbooks: ibm.com/redbooks/abstracts/tips0813.html IBM RackSwitch G8264 (10GbE Switch) - On ibm.com: ibm.com/systems/networking/switches/rack/g On IBM Redbooks: ibm.com/redbooks/abstracts/tips0815.html MapR: - MapR main website: - MapR products: - MapR M7 overview: - MapR No NameNode Architecture: - MapR Resources: - MapR products and differentiation: - MapR starting point: - Planning your MapR Deployment: - MapR CLDB: - MapR ZooKeeper: Open source software: - Hadoop: - Flume: - HBase: - Hive: - Oozie: - Pig: - ZooKeeper: Other resources - xcat: 32

35 Trademarks and special notices Copyright IBM Corporation References in this document to IBM products or services do not imply that IBM intends to make them available in every country. IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Intel, Intel Inside (logos), MMX, and Pentium are trademarks of Intel Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Information is provided "AS IS" without warranty of any kind. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Information concerning non-ibm products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-ibm list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-ibm products. Questions on the capability of non-ibm products should be addressed to the supplier of those products. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local IBM office or IBM authorized reseller for the full text of the specific Statement of Direction. Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning. 33

36 Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here. Photographs shown are of engineering prototypes. Changes may be incorporated in production models. Any references in this information to non-ibm websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. 34

Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop

Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop Last update: 17 September 2015 Configuration reference number: BDAMAPRXX53 MapR delivers on the promise of Hadoop with

More information

Redpaper. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture. Introduction

Redpaper. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture. Introduction Redpaper Steven Hurley James C. Wang IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture Introduction The IBM System x reference architecture is a predefined

More information

Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads

Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads Solution Overview Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads What You Will Learn MapR Hadoop clusters on Cisco Unified Computing System (Cisco UCS

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Lenovo ThinkServer and Cloudera Solution for Apache Hadoop

Lenovo ThinkServer and Cloudera Solution for Apache Hadoop Lenovo ThinkServer and Cloudera Solution for Apache Hadoop For next-generation Lenovo ThinkServer systems Lenovo Enterprise Product Group Version 1.0 December 2014 2014 Lenovo. All rights reserved. LENOVO

More information

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack HIGHLIGHTS Real-Time Results Elasticsearch on Cisco UCS enables a deeper

More information

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how EMC Elastic Cloud Storage (ECS ) can be used to streamline the Hadoop data analytics

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Cisco IT Hadoop Journey

Cisco IT Hadoop Journey Cisco IT Hadoop Journey Alex Garbarini, IT Engineer, Cisco 2015 MapR Technologies 1 Agenda Hadoop Platform Timeline Key Decisions / Lessons Learnt Data Lake Hadoop s place in IT Data Platforms Use Cases

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

How Cisco IT Built Big Data Platform to Transform Data Management

How Cisco IT Built Big Data Platform to Transform Data Management Cisco IT Case Study August 2013 Big Data Analytics How Cisco IT Built Big Data Platform to Transform Data Management EXECUTIVE SUMMARY CHALLENGE Unlock the business value of large data sets, including

More information

MapR Enterprise Edition & Enterprise Database Edition

MapR Enterprise Edition & Enterprise Database Edition MapR Enterprise Edition & Enterprise Database Edition Reference Architecture A PSSC Labs Reference Architecture Guide June 2015 Introduction PSSC Labs continues to bring innovative compute server and cluster

More information

Apache Hadoop: Past, Present, and Future

Apache Hadoop: Past, Present, and Future The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer [email protected], twitter: @awadallah Hadoop Past

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/

More information

White Paper. Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference Configurations

White Paper. Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference Configurations White Paper Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference Configurations Contents Next-Generation Hadoop Solution... 3 Greenplum MR: Hadoop Reengineered... 3 : The Exclusive

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

HadoopTM Analytics DDN

HadoopTM Analytics DDN DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate

More information

MESOS CB220. Cluster-in-a-Box. Network Storage Appliance. A Simple and Smart Way to Converged Storage with QCT MESOS CB220

MESOS CB220. Cluster-in-a-Box. Network Storage Appliance. A Simple and Smart Way to Converged Storage with QCT MESOS CB220 MESOS CB220 Cluster-in-a-Box Network Storage Appliance A Simple and Smart Way to Converged Storage with QCT MESOS CB220 MESOS CB220 A Simple and Smart Way to Converged Storage Tailored for SMB storage

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Scala Storage Scale-Out Clustered Storage White Paper

Scala Storage Scale-Out Clustered Storage White Paper White Paper Scala Storage Scale-Out Clustered Storage White Paper Chapter 1 Introduction... 3 Capacity - Explosive Growth of Unstructured Data... 3 Performance - Cluster Computing... 3 Chapter 2 Current

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop

More information

CDH AND BUSINESS CONTINUITY:

CDH AND BUSINESS CONTINUITY: WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2016 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

How To Write An Article On An Hp Appsystem For Spera Hana

How To Write An Article On An Hp Appsystem For Spera Hana Technical white paper HP AppSystem for SAP HANA Distributed architecture with 3PAR StoreServ 7400 storage Table of contents Executive summary... 2 Introduction... 2 Appliance components... 3 3PAR StoreServ

More information

Big Data Management and Security

Big Data Management and Security Big Data Management and Security Audit Concerns and Business Risks Tami Frankenfield Sr. Director, Analytics and Enterprise Data Mercury Insurance What is Big Data? Velocity + Volume + Variety = Value

More information

Virtualizing Apache Hadoop. June, 2012

Virtualizing Apache Hadoop. June, 2012 June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING

More information

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System By Jake Cornelius Senior Vice President of Products Pentaho June 1, 2012 Pentaho Delivers High-Performance

More information

Big Data Introduction

Big Data Introduction Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights

More information

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm ( Apache Hadoop 1.0 High Availability Solution on VMware vsphere TM Reference Architecture TECHNICAL WHITE PAPER v 1.0 June 2012 Table of Contents Executive Summary... 3 Introduction... 3 Terminology...

More information

<Insert Picture Here> Big Data

<Insert Picture Here> Big Data Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big

More information

Protecting Big Data Data Protection Solutions for the Business Data Lake

Protecting Big Data Data Protection Solutions for the Business Data Lake White Paper Protecting Big Data Data Protection Solutions for the Business Data Lake Abstract Big Data use cases are maturing and customers are using Big Data to improve top and bottom line revenues. With

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Dell Reference Configuration for Hortonworks Data Platform

Dell Reference Configuration for Hortonworks Data Platform Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Platfora Big Data Analytics

Platfora Big Data Analytics Platfora Big Data Analytics ISV Partner Solution Case Study and Cisco Unified Computing System Platfora, the leading enterprise big data analytics platform built natively on Hadoop and Spark, delivers

More information

Big data management with IBM General Parallel File System

Big data management with IBM General Parallel File System Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers

More information

White Paper. Managing MapR Clusters on Google Compute Engine

White Paper. Managing MapR Clusters on Google Compute Engine White Paper Managing MapR Clusters on Google Compute Engine MapR Technologies, Inc. www.mapr.com Introduction Google Compute Engine is a proven platform for running MapR. Consistent, high performance virtual

More information

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra A Quick Reference Configuration Guide Kris Applegate [email protected] Solution Architect Dell Solution Centers Dave

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved.

Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved. Saving Millions through Data Warehouse Offloading to Hadoop Jack Norris, CMO MapR Technologies MapR Technologies. All rights reserved. MapR Technologies Overview Open, enterprise-grade distribution for

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

Apache Hadoop Cluster Configuration Guide

Apache Hadoop Cluster Configuration Guide Community Driven Apache Hadoop Apache Hadoop Cluster Configuration Guide April 2013 2013 Hortonworks Inc. http://www.hortonworks.com Introduction Sizing a Hadoop cluster is important, as the right resources

More information

Integrated Grid Solutions. and Greenplum

Integrated Grid Solutions. and Greenplum EMC Perspective Integrated Grid Solutions from SAS, EMC Isilon and Greenplum Introduction Intensifying competitive pressure and vast growth in the capabilities of analytic computing platforms are driving

More information

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance

More information

Enabling High performance Big Data platform with RDMA

Enabling High performance Big Data platform with RDMA Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery

More information

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

PolyServe Matrix Server for Linux

PolyServe Matrix Server for Linux PolyServe Matrix Server for Linux Highly Available, Shared Data Clustering Software PolyServe Matrix Server for Linux is shared data clustering software that allows customers to replace UNIX SMP servers

More information

EMC SYNCPLICITY FILE SYNC AND SHARE SOLUTION

EMC SYNCPLICITY FILE SYNC AND SHARE SOLUTION EMC SYNCPLICITY FILE SYNC AND SHARE SOLUTION Automated file synchronization Flexible, cloud-based administration Secure, on-premises storage EMC Solutions January 2015 Copyright 2014 EMC Corporation. All

More information

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Drive operational efficiency and lower data transformation costs with a Reference Architecture for an end-to-end optimization and offload

More information

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise ESSENTIALS Easy-to-use, single volume, single file system architecture Highly scalable with

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a

More information

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate

More information

I/O Considerations in Big Data Analytics

I/O Considerations in Big Data Analytics Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

More information

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory) WHITE PAPER Oracle NoSQL Database and SanDisk Offer Cost-Effective Extreme Performance for Big Data 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Abstract... 3 What Is Big Data?...

More information

HADOOP. Revised 10/19/2015

HADOOP. Revised 10/19/2015 HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...

More information

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000 Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000 Clear the way for new business opportunities. Unlock the power of data. Overcoming storage limitations Unpredictable data growth

More information

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage Cisco for SAP HANA Scale-Out Solution Solution Brief December 2014 With Intelligent Intel Xeon Processors Highlights Scale SAP HANA on Demand Scale-out capabilities, combined with high-performance NetApp

More information

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7 Introduction 1 Performance on Hosted Server 1 Figure 1: Real World Performance 1 Benchmarks 2 System configuration used for benchmarks 2 Figure 2a: New tickets per minute on E5440 processors 3 Figure 2b:

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

MarkLogic and Cisco: A Next-Generation, Real-Time Solution for Big Data

MarkLogic and Cisco: A Next-Generation, Real-Time Solution for Big Data MarkLogic and Cisco: A Next-Generation, Real-Time Solution for Big Data MarkLogic Enterprise NoSQL Database and Cisco Unified Computing System provide a single, integrated hardware and software infrastructure

More information

Networking in the Hadoop Cluster

Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Parallels Cloud Storage

Parallels Cloud Storage Parallels Cloud Storage White Paper Best Practices for Configuring a Parallels Cloud Storage Cluster www.parallels.com Table of Contents Introduction... 3 How Parallels Cloud Storage Works... 3 Deploying

More information

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director [email protected] Dave Smelker, Managing Principal [email protected]

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

How To Use Hp Vertica Ondemand

How To Use Hp Vertica Ondemand Data sheet HP Vertica OnDemand Enterprise-class Big Data analytics in the cloud Enterprise-class Big Data analytics for any size organization Vertica OnDemand Organizations today are experiencing a greater

More information

OPTIMIZING SERVER VIRTUALIZATION

OPTIMIZING SERVER VIRTUALIZATION OPTIMIZING SERVER VIRTUALIZATION HP MULTI-PORT SERVER ADAPTERS BASED ON INTEL ETHERNET TECHNOLOGY As enterprise-class server infrastructures adopt virtualization to improve total cost of ownership (TCO)

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

How to Hadoop Without the Worry: Protecting Big Data at Scale

How to Hadoop Without the Worry: Protecting Big Data at Scale How to Hadoop Without the Worry: Protecting Big Data at Scale SESSION ID: CDS-W06 Davi Ottenheimer Senior Director of Trust EMC Corporation @daviottenheimer Big Data Trust. Redefined Transparency Relevance

More information

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look IBM BigInsights Has Potential If It Lives Up To Its Promise By Prakash Sukumar, Principal Consultant at iolap, Inc. IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based

More information