Redpaper. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture. Introduction
|
|
|
- Magdalen Rogers
- 10 years ago
- Views:
Transcription
1 Redpaper Steven Hurley James C. Wang IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture Introduction The IBM System x reference architecture is a predefined and optimized hardware infrastructure for IBM InfoSphere BigInsights 2., which is a distribution of Apache Hadoop with added value capabilities that are specific to IBM. The reference architecture provides a predefined hardware configuration for implementing InfoSphere BigInsights 2. on System x hardware. The reference architecture can be implemented in two ways to support Platform Symphony MapReduce workloads or Apache HBase workloads: Platform Symphony MapReduce is a core component of Hadoop that provides a job scheduler and management framework for batch-oriented, high-throughput data access and distributed computation. Apache HBase is a schemaless, No-SQL database that is built upon Hadoop to provide high throughput random data reads and writes and data caching. The predefined configuration is a baseline configuration for an InfoSphere BigInsights cluster and provides modifications for an InfoSphere BigInsights cluster that is running HBase. The predefined configurations can be modified based on the specific customer requirements, such as lower cost, improved performance, and increase reliability. Business problem and business value This section describes the business problem that is associated with big data environments and the value that InfoSphere BigInsights offers. Copyright IBM Corp. 203, 20. All rights reserved. ibm.com/redbooks
2 Business problem Every day, we create 2.5 quintillion bytes of data. It is so much that 90% of the data in the world today was created in the last two years alone. This data comes from everywhere, such as sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals. This data is big data. Big data spans three dimensions: Volume. Big data comes in one size; that is large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information. Velocity. Often time-sensitive, big data must be used as it is streaming into the enterprise to maximize its value to the business. Variety. Big data extends beyond structured data, including unstructured data of all varieties, including text, audio, video, click streams, and log files. Big data is more than a challenge. It is an opportunity to find insight in new and emerging types of data, to make your business more agile, and to answer questions that, in the past, were beyond reach. Until now, there was no practical way to harvest this opportunity. Today, IBM s platform for big data uses such technologies as the real-time analytics processing capabilities of stream computing and the massive Platform Symphony MapReduce scale-out capabilities of Hadoop to open the door to a world of possibilities. As part of the IBM platform for big data, IBM InfoSphere Streams allow you to capture and act on all of your business data, all of the time, just in time. Business value IBM InfoSphere BigInsights brings the power of Apache Hadoop to the enterprise. Hadoop is the open source software framework that is used to reliably manage large volumes of structured and unstructured data. InfoSphere BigInsights enhances this technology to withstand the demands of your enterprise, adding administrative, workflow, provisioning, and security features, along with best-in-class analytical capabilities from IBM Research. The result is a more developer-compatible and user-compatible solution for complex, large-scale analytics. How can businesses process tremendous amounts of raw data in an efficient and timely manner to gain actionable insights? By using InfoSphere BigInsights, organizations can run large-scale, distributed analytics jobs on clusters of cost-effective server hardware. This infrastructure can be used to tackle large data sets by breaking up the data into chunks and coordinating the processing of the data across a massively parallel environment. When the raw data is stored across the nodes of a distributed cluster, queries and analysis of the data can be handled efficiently, with dynamic interpretation of the data format at read time. The bottom line is that businesses can finally embrace massive amounts of untapped data and mine that data for valuable insights in a more efficient, optimized, and scalable way. Reference architecture use The System x Reference Architecture for Hadoop: InfoSphere BigInsights represents a well-defined starting point for architecting a BigInsights hardware and software solution and can be modified to meet client requirements. 2 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
3 When reviewing the potential of using System x with InfoSphere BigInsights, use this reference architecture paper as part of an overall assessment process with a customer. When working on a big data proposal with a client, you can go through several phases and activities as outlined in the following list and in Table : Discover the client s technical requirements and usage (hardware, software, data center, workload, user data, and high availability). Analyze the client s requirements and current environment. Exploit with proposals based on IBM hardware and software. Table Client technical discovery, analysis, and exploitation Discover Analyze Exploit New applications Determine data storage requirements, including user data size and compression ratio. Determine high availability requirements. Determine customer corporate networking requirements, such as networking infrastructure and IP addressing. Determine whether data node OS disks require mirroring. Determine disaster recovery requirements, including backup/recovery and multisite disaster recover requirements. Determine cooling requirements, such as airflow and BTU requirements. Determine workload characteristics, such as Platform Symphony MapReduce or HBase. Identify cluster management strategy, such as node firmware and OS updates. Identify a cluster rollout strategy, such as node hardware and software deployment. Propose InfoSphere BigInsights cluster as the solution to big data problems. Use the IBM System x M architecture for easy scalability of storage and memory. Existing applications Determine data storage requirements and existing shortfalls. Determine memory requirements and existing shortfalls. Determine throughput requirements and existing bottlenecks. Identify system utilization inefficiencies. Propose a nondisruptive and lower risk solution. Propose a Proof-of-Concept (PoC) for the next server deployment. Propose an InfoSphere BigInsights cluster as a solution to big data problems. Use System x M architecture for easy scalability of storage and memory. Data center health Determine server sprawl. Determine electrical, cooling, space headroom. Identify inefficiency concerns. Propose a scalable InfoSphere BigInsights cluster. Propose lowering data center costs with energy efficient System x servers. Requirements The hardware and software requirements for the System x Reference Architecture for Hadoop: InfoSphere BigInsights are embedded throughout this IBM Redpaper publication within the appropriate sections. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 3
4 InfoSphere BigInsights predefined configuration This section describes the predefined configuration for InfoSphere BigInsights reference architecture. Architectural overview From an infrastructure design perspective, Hadoop has two key aspects: Hadoop Distributed File System (HDFS) and Platform Symphony MapReduce. An IBM InfoSphere BigInsights reference architecture solution has three server roles: Management nodes Data nodes Edge nodes Nodes that are implemented on System x3550 M servers. These nodes encompass InfoSphere BigInsights daemons that are related to managing the cluster and coordinating the distributed environment. Nodes that are implemented on System x 3630 BD servers. These nodes encompass daemons that are related to storing data and accomplishing work within the distributed environment. Nodes that act as a boundary between the InfoSphere BigInsights cluster and the outside (client) environment. The number of each type of node that is required within an InfoSphere BigInsights cluster depends on the client requirements. Such requirements might include the size of a cluster, the size of the user data, the data compression ratio, workload characteristics, and data ingest. HDFS is the file system in which Hadoop stores data. HDFS provides a distributed file system that spans all the nodes within a Hadoop cluster, linking the files systems on many local nodes to make one big file system with a single namespace. HDFS has three associated daemons: NameNode Runs on a management node and is responsible for managing the HDFS namespace and access to the files stored in the cluster. Secondary NameNode Typically runs on a management node and is responsible for maintaining periodic check points for recovery of the HDFS namespace if the NameNode daemon fails. The Secondary NameNode is a distinct daemon and is not a redundant instance of the NameNode daemon. DataNode Runs on all data nodes and is responsible for managing the storage that is used by HDFS across the BigInsights Hadoop Cluster. InfoSphere BigInsights 2. comes with two options for Platform Symphony MapReduce. These are Platform Symphony MapReduce v, which is a part of the Apache Hadoop open source project, and IBM Adaptive MapReduce. IBM Adaptive MapReduce is low-latency job scheduler capable of running distributed application services on a scalable, shared, heterogeneous grid and supports sophisticated workload management capabilities beyond those of standard Hadoop Platform Symphony MapReduce. Platform Symphony MapReduce is the distributed computing and high-throughput data access framework through which Hadoop understands jobs and assigns work to servers within the BigInsights Hadoop cluster. The Apache Hadoop Platform Symphony MapReduce has two associated daemons: IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
5 JobTracker TaskTracker Runs on a management node and is responsible for submitting, tracking, and managing Platform Symphony MapReduce jobs. Runs on all data nodes and is responsible for completing the actual work of a Platform Symphony MapReduce job, reading data that is stored within HDFS and running computations against that data. Additionally, InfoSphere BigInsights has an administrative console that helps administrators to maintain servers, manage services and HDFS components, and manage data nodes within the InfoSphere BigInsights cluster. The InfoSphere BigInsights console runs on a management node. Component model Figure illustrates the component model for the InfoSphere BigInsights Reference Architecture. HDFS Services MapReduce Services NameNode Secondary NameNode JobTracker BigInsights Console Management Nodes DataNode TaskTracker DataNode TaskTracker DataNode TaskTracker Data Nodes Figure InfoSphere BigInsights Reference Architecture component model Regarding networking, the reference architecture specifies two networks for a Platform Symphony MapReduce implementation: A data network, and an administrative and management network. All networking is based on IBM RackSwitch switches. For more information about networking, see Networking configuration on page 8. To facilitate easy sizing, the predefined configuration for the reference architecture comes in three sizes: Starter rack configuration Consists of three data nodes, the required number of management nodes, and the required IBM RackSwitch switches. Half rack configuration Consists of nine data nodes, the required number of management nodes, and the required IBM RackSwitch switches. Full rack configuration Consists of up to 20 data nodes, the required number of management nodes, and the required IBM RackSwitch switches. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 5
6 The configuration is not limited to these sizes, and any number of data nodes is supported. For more information about the number of data nodes per rack in full-rack and multi-rack configurations, see Rack considerations on page. Cluster node and networking configuration and sizing This section describes the predefined configurations for management nodes, data nodes, and networking for an InfoSphere BigInsights solution. Management node configuration and sizing Management nodes encompass the following HDFS, Platform Symphony MapReduce, and BigInsights management daemons: NameNode Secondary NameNode JobTracker BigInsights Console The management node is based on the IBM System x3550 M server. Table 2 lists the predefined configuration of a management node. Table 2 Management node predefined configuration Component System Processor Memory - base Disk (OS and Application) HDD controller Hardware storage protection User space (per server) Administration/management network adapter Predefined configuration System x3550 M 2 x E v2 2.6 GHz 8-core 28 GB = 8 x 6 GB 866 MHz RDIMM, 2, or 3 x 3.5-inch NL SATA (same capacity as data nodes) a ServeRAID M5 SAS/SATA Controller RAID hardware mirroring of two disk drives None Integrated GBaseT Adapter Data network adapter 2 x Mellanox ConnectX-3 EN Dual-port SFP+ 0GbE Adapters a. The recommended default number of drives is two to provide fault tolerance that is based on RAID hardware mirroring of the two drives. An InfoSphere BigInsights Hadoop Platform Symphony MapReduce cluster requires between one and four management nodes, depending on the client s environment. Table 3 on page 7 specifies the number of required management nodes. In this table, the columns that contain node information represent InfoSphere BigInsights Hadoop services that are housed across cluster management nodes. 6 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
7 Table 3 Platform Symphony MapReduce cluster required management nodes Environment Required management nodes Node Node 2 Node 3 Node Development Environment NameNode a, JobTracker, BigInsights Console N/A N/A N/A Production/Test Environment 3 b NameNode JobTracker, Secondary NameNode BigInsights Console N/A Production/Test Environment with Highly Available NameNode b NameNode (Active or Standby) a. In a single management node configuration, place the Secondary NameNode on a data node to enable recoverability of the HDFS namespace if a failure of the management node occurs. b. For fault recoverability in multirack production and test environments where no UPS is utilized, whenever possible, avoid placing management node and management node 2 in the same rack. If a UPS is utilized, the recommendation is to distribute management nodes such that power to all management nodes is provided via the UPS source to allow management-related data to be synced down to local disk or to HA NFS. Data node configuration and sizing Data nodes house the Hadoop HDFS and Platform Symphony MapReduce daemons: DataNode and TaskTracker. The data node is based on the IBM System x3650 M BD storage-rich server. The System x3650 M BD is a purpose-built big data storage server engineered to provide the optimal blend of performance, uptime, and abundant, low-cost storage. Table describes the predefined configuration for a data node. Table Data node predefined configuration NameNode (Active or Standby) JobTracker BigInsights Console Component System Processor Memory - base Disk (OS) a Disk (data) b HDD controller Hardware storage protection Management network adapter Predefined configuration System x3650 M BD 2 x E v2 2.6 GHz 8-core 6 GB = 8x 8 GB 866 MHz RDIMM 3 TB drives: or 2 x 3 TB NL SATA 3.5-inch TB drives: or 2 x TB NL SATA 3.5-inch 3 TB drives: 2 x 3 TB NL SATA 3.5-inch (36 TB total) TB drives: 2 x TB NL SATA 3.5-inch (8 TB total) N225 2 Gb JBOD Controller None (JBOD). By default, HDFS maintains a total of three copies of data that is stored within the cluster. The copies are distributed across data servers and racks for fault recovery. Integrated GBaseT Adapter Data network adapter Mellanox ConnectX-3 EN Dual-port SFP+ 0GbE Adapter a. OS drives are recommended to be the same size as the data drives. If two OS drives are used, drives can be configured in a just a bunch of disks (JBOD) or RAID hardware mirroring configuration. Available space on the OS drives can also be used for more HDFS storage, more Platform Symphony MapReduce shuffle/sort space, or both. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 7
8 b. All data drives should be of the same size, 3 TB or TB. When you estimate disk space within an InfoSphere BigInsights Hadoop cluster, consider the following points: For improved fault tolerance and improved performance, HDFS replicates data blocks across multiple cluster data nodes. By default, HDFS maintains three replicas. During Platform Symphony MapReduce processing, intermediate shuffle/sort data is written by Mappers to storage and pulled by Reducers, potentially between data nodes, during the reduce phase. If the Platform Symphony MapReduce job requires more than the available shuffle file space, the job will terminate. As a rule of thumb, reserve 25% of total disk space for the local file system as shuffle file space. The actual space that is required for shuffle/sort is workload-dependent. In the unusual situation where the 25% rule of thumb is insufficient, available space on the OS drives can be used to provide more shuffle/sort space. The compression ratio is an important consideration in estimating disk space. Within Hadoop, both the user data and the shuffle/sort data can be compressed. Assume 35% compression if customer-specific compression data is not available. Note: A 35% compression is an estimate based on measurements taken in a controlled environment. Compression results vary based on data and compression libraries used. IBM can not guarantee compression results or compressed data storage amounts. Improved estimates can be calculated by testing customer data using appropriate compression libraries. Assuming that the default three replicas are maintained by HDFS, the total cluster data space and the required number of data nodes can be estimated by using the following equations: Total Data Disk Space = x (Uncompressed Raw User Data) x (% Compression) Total Required s = (Total Data Disk Space) / (Data Space per Server) When you estimate disk space, also consider future growth requirements. Networking configuration Regarding networking, the reference architecture specifies two networks: Data network The data network is a private 0 GbE cluster data interconnect among data nodes that are used for data access, moving data across nodes within the cluster and ingesting data into HDFS. The InfoSphere BigInsights cluster typically connects to the client s corporate data network by using one or more edge nodes. These edge nodes can be System x 3550 M servers, other System x servers, or other client-specified server. Edge nodes act as interface nodes between the InfoSphere BigInsights cluster and the outside client environment (for example, data ingested from a corporate network into a cluster). Not every rack has an edge node connection to a client network. Data can be ingested into the cluster via edge nodes or via parallel ingest. Administrative/management network The administrative/management network is a GbE network that is used for in-band OS administration and out-of-band hardware management. In-band administrative services, such as Secure Shell (SSH) or Virtual Network Computing (VNC), that run on the host operating system allow administration of cluster nodes. Out-of-band management, by using the Integrated Management Module II (IMM2) within the x3550 M and x3650 M BD, allows hardware-level management of cluster nodes, such as node deployment or 8 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
9 BIOS configuration. Hadoop has no dependency on IMM2. Based on client requirements, the administration and management links can be segregated onto separate VLANs or subnets. The administrative/management network is typically connected directly into the client s administrative network. Figure 2 shows a predefined InfoSphere BigInsights cluster network. Corporate Network Data Edge Edge Node Edge Node Node Data Network Corporate Network Admin Admin and IMM Network BigInsights Cluster Figure 2 Predefined cluster network Table 5 shows the IBM rack switches that are used in the reference architecture. Table 5 IBM rack switches Rack switch GbE top-of-rack switch for administration/management network (two physical links to each node: one link for in-band OS administration and one link for out-of-band IMM2 hardware management). a 0 GbE top-of-rack switch for data network (two physical 0 GbE links to each node, aggregated). b Predefined configuration IBM System Networking RackSwitch G8052 IBM System Networking RackSwitch G826 0 GbE switch for interconnecting data network across multiple racks (0 GbE links interconnecting each G826 top-of-rack switch; link aggregation depends on the number of core switches and interconnect topology). b IBM System Networking RackSwitch G836 (6 x 0 GbE ports) or G8332 (32 x 0 GbE ports) c a. The administrative links and management links can be segregated onto separate VLANS or subnets. b. To avoid a single point of failure, use redundant top-of-rack (TOR) and core switches. c. Using the G port 0 GbE switch allows aggregating more racks per core switch. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 9
10 A C DC A C DC A C DC A C DC A C DC A C DC A C DC A C DC A C DC A C DC A C DC A C DC A C DC A C DC A C DC A C DC A C DC A C DC A C DC A C DC ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply ATTENT ION Powersup ply file ris re q u ired fo rsys tem c o o lin g Re mo v e o n ly wh e n in s taling 2nd p o w e rs upply SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link SYS MG MT TX/RX Link TX/RX 2 Link x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved x8 (8 ) 3 x 6 (8,,) Re s e rved Figure 3 shows the networking predefined connections within a rack. Customer Network Admin Gb Link Switch G8052 Gb Link Port Mgmt 2 * 0 Gb Uplinks x 0 Gb uplinks Switch G * 0 Gb Links Edge Node 2 * 0 Gb Links Data Customer Network Admin Network IMM Network Data Network LACP of 2 links x3630 M 2U Rack 0 U Figure 3 Networking predefined configuration The networking predefined configuration has the following characteristics: The administration/management network is typically connected to the client s administration network. Management and data nodes each have two administration/management network links: One link for in-band OS administration and one link for out-of-band IMM2 hardware management. On the x3550 M management nodes, the administration link should connect to port on the integrated GBaseT adapter, and the management link should connect to the dedicated IMM2 port. On the System x3650 M BD data nodes, the administration link should connect to port on the integrated GBaseT adapter, and the management link should connect to the dedicated IMM2 port. The data network is a private VLAN or subnet. The two Mellanox 0 GbE ports of each data node are link aggregated to G826 for better performance and improved high availability. The cluster administration/management network is connected to the corporate data network. Each node has two links to the G8052 RackSwitch at the top of the rack, one for the administration network and one for the IMM2. Within each rack, the G8052 has two uplinks to the G826 to allow propagation of the administrative/management VLAN across cluster racks by using the G836 core switch. Not every rack has an edge node connection to the client s corporate data network. For more information about edge nodes, see Customizing the predefined configurations on page 2. Given the importance of their role within the cluster, System x3550 M management nodes have two Mellanox dual-port 0 GbE networking cards for fault tolerance. The first port on each Mellanox card should connect back to the G826 switch at the top of the rack. The second port on each Mellanox card is available to connect into the client s data network in cases where the node functions as an edge node for data ingest and access. 0 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
11 Figure shows the rack-level connections in greater detail. x 0 Gb Uplinks Mgmt Port G826 ( required, 2 for HA) 0 Gb ports used uplinks reserved for Scale out 0 Gb links Edge Node ( required, 2 or more for HA/Parallelism) Customer Network Data Gb link 0 Gb links Data Management Node (Prod/test:3, Dev:) (8) 0 Gb links Data 0 Gb Uplink to Core Switch Customer Data Network Data Network, private IP addresses Administration/IMM Network, corporate IP addresses Customer Administration network Gb link Admin Gb link IMM Gb link Admin Gb link IMM 2x 0 Gb Uplinks G8052 x Gb ports used 2x 0 Gb uplinks used Gb link Customer Network Admin Big Data Rack Figure Big data rack connections The data network is connected across racks by two aggregated 0 GbE uplinks from each rack s G826 switch to a core G836 switch. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
12 Figure 5 shows the cross rack networking by using the core switch. G836 0 Gb 0 Gb Big Data Rack # Big Data Rack #2 Big Data Rack #3 Big Data Rack # Big Data Rack #5 Big Data Rack #6 Big Data Rack #7 Mgmt Port G Gb G826 G8052 G8052 G8052 G8052 G8052 G8052 Gb Mgmt Mgmt Mgmt Mgmt Mgmt G826 Port G826 Port G826 Port G826 G826 G826 Mgmt Port 0 Gb Edge Node Edge Node Edge Node Port Port Customer Network Admin Customer Network Data Uplinks from G826 to G836 Customer Data Network Data Network (private IP addresses) Admin/IMM Network (corporate IP addresses) Customer Administration network Figure 5 Cross rack networking Edge node considerations The edge node acts as a boundary between the InfoSphere BigInsights cluster and the outside (client) environment. The edge node is used for data ingest, which refers to routing data into the cluster through the data network of the reference architecture. Edge nodes can be System x3550 M servers, other System x servers, or other client-provided servers. Table 6 provides a predefined edge node configuration of the reference architecture for InfoSphere BigInsights. Table 6 Edge node predefined configuration Component System Processor Memory - base Disk (OS) Disk (Application) HDD controller Predefined configuration System x3550 M 2 x E v2 2.6 GHz 8-core 28 GB = 8 x 6 GB 866 MHz RDIMM 2 x 600 GB 2.5-inch SAS 2 x 600 GB 2.5-inch SAS ServeRAID M5 SAS/SATA Controller Hardware storage protection OS storage on 2 x 600 GB drives that are mirrored by using RAID hardware mirroring. Application storage on 2 x 600 GB drives in JBOD or RAID hardware mirroring configuration. 2 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
13 Component Administration/management network adapter Data network adapter Predefined configuration Integrated GBaseT Adapter 2 x Mellanox ConnectX-3 EN Dual-port SFP+ 0 GbE Adapters With the design of the System x3550 management node, the same configuration can be used as an edge node. When you use this configuration as an edge node, the first port on each Mellanox dual-port 0GbE network adapter connects back to the G826 switch at the top of the node s home rack. The second port on each Mellanox dual-port 0GbE network adapter connects to the client s data network. This edge node design serves as a ready-made platform for extract, transform, and load (ETL) tools, such as IBM InfoSphere DataStage. Although a BigInsights cluster can have multiple edge nodes, depending on applications and workload, not every cluster rack needs to be connected to an edge node. However, every data node within the BigInsights cluster must be a cluster data network IP address that is routable from within the corporate data network. As gateways into the BigInsights cluster, you must properly size edge nodes to ensure that they do not become a bottleneck for accessing the cluster, for example, during high volume ingest periods. Important: The number of edge nodes and the edge node server physical attributes that are required depend on ingest volume and velocity. Because of physical space constraints within a rack, adding an edge node to a rack can displace a data node. In low volume/velocity ingest situations (< GB/hr), the InfoSphere BigInsights console management node can be used as an edge node. InfoSphere DataStage and InfoSphere Data Click servers can also function as edge nodes. When using InfoSphere DataStage or other ETL software, consult an appropriate ETL specialist for server selection. In Proof-of-Concept (PoC) situations, the edge node can be used to isolate both cluster networks (data and administrative/management) from the customer corporate network. Power considerations Within racks, switches and management nodes have redundant power feeds with each power feed connected from a separate protocol data unit (PDU). Data nodes have a single power feed, and the data node power feeds should be connected so that all power feeds within the rack are balanced across the PDUs. Figure 6 on page shows power connections within a full rack with three management nodes. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 3
14 PDU 30A PDU 30A G8052 G826 Management Data Node Node Management Node Management Node PDU 30A PDU 30A Figure 6 Power connections Rack considerations Within a rack, data nodes occupy 2U of space and management nodes, and rack switches occupy U of space. A one-rack InfoSphere BigInsights implementation comes in three sizes: Starter rack, half rack, and full rack. These three sizes allow for easy ordering. However, reference architecture sizing is not rigid and supports any number of data nodes with the appropriate number of management nodes. Table 7 on page 5 describes the node counts. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
15 Table 7 Rack configuration node counts Rack configuration size Number of data nodes a Number of management nodes b Starter rack 3 c, 3, or Half rack 9, 3, or Full rack with management nodes 8 d, 3, or Full data node rack, no management 20 0 nodes a. Maximum number of data nodes per full rack based on network switches, management nodes, and data nodes. Adding edge nodes to the rack can displace additional data nodes. b. The number of management notes depends on development or the production/test environment type. For more information about selecting the correct number of management nodes, see Management node configuration and sizing on page 6. c. The starter rack can be expanded to a full rack by adding more data and management nodes. d. A full rack with one or two management nodes can accommodate up to 9 data nodes. An InfoSphere BigInsights implementation can be deployed as a multirack solution. If the system is initially implemented as a multirack solution or if the system grows by adding more racks, to maximize fault tolerance, distribute the cluster management nodes across racks. In the reference architecture for InfoSphere BigInsights, a fully populated predefined rack with one G826 switch and one G8052 switch can support up to 20 data nodes. However, the total number of data nodes that a rack can accommodate can vary based on the number of top-of-rack switches and management nodes that are required for the rack within the overall solution design. The number of data nodes can be calculated by the following equation: Maximum number data nodes = (2U - (# U Switches + # U Management Nodes)) / 2 Edge nodes: This calculation does not consider edge nodes. Based on the client s choice of edge node, proportions can vary. Every two U edge nodes displace one data node, and every one 2U displaces one data node. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 5
16 M M M Figure 7 shows an example of a starter, half-rack, and a full-rack configuration. Starter Rack PDU 30A G8052 G826 Management Node Management Node Management Node PDU 30A PDU 30A PDU 30A Half Rack G8052 G826 Management Node Management Node Management Node PDU 30A PDU 30A PDU 30A PDU 30A Full Rack G8052 G826 Management Node Management Node Management Node PDU 30A PDU 30A Figure 7 Sample configuration 6 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
17 M Data Figure 8 shows an example of scale-out rack configurations. Full Rack (One Management Node) PDU 30A PDU 30A G8052 G826 Management Node PDU 30A PDU 30A Full Rack (s Only) PDU 30A PDU 30A G8052 G826 PDU 30A PDU 30A Figure 8 Sample configuration InfoSphere BigInsights HBase predefined configuration This section describes the predefined configuration for InfoSphere BigInsights HBase reference architecture. Architectural overview HBase is a schemaless, No-SQL database that is implemented within the Hadoop environment and is included in InfoSphere BigInsights. HBase has its own set of daemons that run on management nodes and data nodes. The HBase daemons are in addition to the management node and data node daemons of HDFS and Platform Symphony MapReduce, as described in InfoSphere BigInsights predefined configuration on page. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 7
18 HBase has two more daemons that run on master nodes: HMaster The HBase master daemon. It is responsible for monitoring the HBase cluster and is the interface for all metadata changes. ZooKeeper A centralized daemon that enables synchronization and coordination across the HBase cluster. HBase has one daemon that runs on all data nodes, the HRegionServer daemon. The HRegionServer daemon is responsible for managing and serving HBase regions. Within HBase, a region is the basic unit of distribution of an HBase table, allowing a table to be distributed across multiple servers within a cluster. Use care when considering running Platform Symphony MapReduce workloads in a cluster that is also running HBase. Platform Symphony MapReduce jobs can use significant resources and can have a negative impact on HBase query performance and service-level agreements (SLAs). Some utilities, such as IBM BigSQL, are able to effectively collocate Platform Symphony MapReduce and HBase workloads within the same cluster. We recommend giving careful consideration before running Platform Symphony MapReduce jobs (beyond those related to HBase utilities) on a cluster that requires low-latency responses to HBase queries. Because HBase is implemented within Hadoop, the reference architecture implementation for HBase has the same three server roles as described in InfoSphere BigInsights predefined configuration on page : Management nodes Based on the System x3550 M server, management nodes house the following HDFS, Platform Symphony MapReduce, and HBase services: NameNode Secondary NameNode JobTracker HMaster ZooKeeper Data nodes Based on the System x3650 M BD server, data nodes house the following HDFS, Platform Symphony MapReduce, and HBase services: DataNode TaskTracker HRegionServer Edge nodes Within a BigInsights Cluster running HBase is a specific number of master nodes and a variable number of data nodes, which are based on customer requirements. 8 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
19 Component model Figure 9 illustrates the component model for the InfoSphere BigInsights HBase reference architecture. NameNode ZooKeeper Secondary NameNode ZooKeeper HMaster ZooKeeper JobTracker ZooKeeper HMaster ZooKeeper BigInsights Console Management Nodes HRegionServer DataNode TaskTracker HRegionServer DataNode TaskTracker HRegionServer DataNode TaskTracker Data Nodes Bold Italic = HBase Services Figure 9 InfoSphere BigInsights HBase reference architecture component model Implementing HBase requires a few modifications to the predefined configuration that is described in InfoSphere BigInsights HBase predefined configuration on page 7. For considerations specific to HBase for the management nodes and data nodes, see Cluster node configuration on page 9. Networking configuration, edge nodes considerations, and power considerations for the InfoSphere BigInsights HBase predefined configuration are identical to those considerations of the InfoSphere BigInsights predefined configuration. For more information, see Networking configuration on page 8 and Power considerations on page 3. Cluster node configuration This section describes the predefined configurations for management nodes and data nodes for an InfoSphere BigInsights HBase solution. The networking configuration is the same as the configuration that is described in Networking configuration on page 8. Management node configuration and sizing Management nodes house the following HDFS, Platform Symphony MapReduce, HBase, and BigInsights management services: NameNode, Secondary NameNode, JobTracker, HMaster, ZooKeeper, and BigInsights Console. The management node is based on the IBM System x3550 M server. Table 8 describes the predefined configuration of a management node. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 9
20 Table 8 Management node predefined configuration Component System Processor Memory - base Disk (OS and Application) HDD controller Hardware storage protection User space (per server) Administration/management network adapter Predefined configuration x3550 M 2 x E v2 2.6 GHz 8-core 28 GB = 8 x 6 GB 866 MHz RDIMM, 2, or 3 x 3.5-inch SATA (same capacity as data nodes) a ServeRAID M5 SAS/SATA Controller RAID hardware mirroring of two disk drives None Integrated GBaseT Adapter Data network adapter 2 x Mellanox ConnectX-3 EN Dual-port SFP+ 0 GbE Adapter a. The recommended default number of drives is two to provide fault tolerance based on RAID hardware mirroring of the two drives. An InfoSphere BigInsights Hadoop cluster that is running HBase requires - 6 management nodes, depending on the cluster size. Table 9 specifies the number of required management nodes. The columns that contain node information represent BigInsights Hadoop daemons that are housed across cluster management nodes. Table 9 Required management nodesdata node configuration and sizing Cluster size Required management nodes Node Node 2 Node 3 Node Node 5 Node 6 Starter cluster NameNode a, JobTracker, HMaster, BigInsights Console, ZooKeeper <20 data nodes b NameNode, ZooKeeper c JobTracker, HMaster, ZooKeeper Secondary NameNode, HMaster, ZooKeeper BigInsights Console >= 20 data nodes 6 d NameNode, ZooKeeper Secondary NameNode e, ZooKeeper JobTracker, ZooKeeper HMaster, ZooKeeper HMaster, ZooKeeper BigInsights Console a. In a single management node configuration, to enable recoverability of the HDFS metadata if a failure of the management node occurs, place the Secondary NameNode on a data node. b. For HBase fault tolerance and HDFS fault recovery if a management node failure occurs, do not place management nodes and 2 in the same rack as management nodes 3 and. c. There is no fixed approach to the number of ZooKeepers and greater than five instances is certainly possible. However, we recommend an odd number of ZooKeeper instances. In some failure modes, odd numbers of ZooKeeper instances permit the ZooKeeper quorum to be established with fewer number of surviving instances. d. For HBase fault tolerance and HDFS fault recovery if a management node failure occurs, do not place management nodes and 2 in the same rack, and do not place management nodes and 5 in the same rack. If a UPS is utilized, the recommendation is to distribute management nodes such that power to all management nodes is provided via the UPS source to allow management-related data to be synced down to local disk or to HA NFS. 20 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
21 e. For HDFS NameNode high availability, the Secondary NameNode can be substituted with a second HDFS NameNode service. Place the active NameNode typically on the node with the fewest total number of management services running. Data node configuration and sizing Data nodes house the following Hadoop services: DataNode, TaskTracker, and HRegionServer. The data node is based on the System x3650 M BD storage-rich server. This data node differs from the base InfoSphere BigInsights predefined configuration in that HBase data nodes have greater memory capacity. Table 0 describes the predefined configuration for a data node. Table 0 Data node predefined configuration Component System Processor Memory - base Disk (OS) a Disk (data) bc HDD controller Hardware storage protection Administration/management network adapter Pre-defined configuration x3650 M BD 2 x E v2 2.6 GHz 8 core 28 GB =6 x 8 GB 866 MHz RDIMM TB drives: or 2 x TB NL SATA 3.5-inch 2 TB drives: or 2 x 2 TB NL SATA 3.5-inch TB drives: 6 to 2 x TB NL SATA 3.5-inch (2 TB total) 2 TB drives: 6 to 2 x 2 TB NL SATA 3.5-inch (2 TB total) N225 2 Gb JBOD Controller None (JBOD). By default, HDFS maintains a total of three copies of data that is stored within the cluster. The copies are distributed across data servers and racks for fault recovery. Integrated GBaseT Adapter Data network adapter Mellanox ConnectX-3 EN Dual-port SFP+ 0GbE a. The OS drives are recommended to be the same size as the data drives. If two OS drives are used, drives can be configured in either a JBOD or RAID hardware mirroring configuration. Available space on the OS drives can also be used for extra HDFS storage, extra Platform Symphony MapReduce shuffle/sort space, or both. b. All data drives should be of the same size, either 3 TB or TB. c. There is a direct relationship between HBase RegionServer JVM heap size and disk capacity whereby the maximum effective disk space usable by an HBase RegionServer is dependent on the JVM heap size. For more information, see the HBase blog entitled HBase region server memory sizing at the following link: When you estimate disk space within a BigInsights HBase cluster, keep in mind the following considerations: For improved fault tolerance and improved performance, HDFS replicates data blocks across multiple cluster data nodes. By default, HDFS maintains three replicas. Reserve approximately 25% of total available disk space for shuffle/sort space. Compression ratio is an important consideration in estimating disk space. Within Hadoop, both the user data and the shuffle data can be compressed. Assume 35% compression if customer-specific compression data is not available. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 2
22 Note: A 35% compression is an estimate based on measurements taken in a controlled environment. Compression results vary based on data and compression libraries used. IBM can not guarantee compression results or compressed data storage amounts. Improved estimates can be calculated by testing customer data using appropriate compression libraries. Add an extra 30-50% for HBase HFile storage and compaction. Assuming that the default three replicas are maintained by HDFS and the HFile storage requirements, the upper bound total cluster data space and required number of data nodes can be estimated by using the following equations: Total Data Disk Space = (User Raw Data, Uncompressed) x ( / compression ratio) x 50% Total Required s = (Total Data Disk Space) / (Data Disk Space per Server) When you estimate disk space, also consider future growth requirements. Rack considerations Within a rack, each data node occupies 2U, and each management node or switch occupies U. The HBase implementation can be deployed in a single-rack or multirack configuration. Table outlines the rack considerations. Important: If the system is initially implemented as a multirack solution or if the system grows by adding more racks, distribute the cluster management nodes across the racks to maximize fault tolerance. Table Rack considerations Cluster size Number of racks Maximum number of data nodes per rack a Starter rack 3 b Number of management nodes per cluster <20 data nodes >= 20 data nodes c a. The maximum number of data nodes per full rack based on network switches, management nodes, and data nodes. Adding edge nodes to the rack can displace extra data nodes. b. A starter rack can be expanded to a full rack by adding more data and management nodes. c. The actual maximum depends on the number of racks that are implemented. To maximize fault tolerance, distribute management nodes across racks. Every two management nodes within a rack displace one data node. 6 In the reference architecture for the InfoSphere BigInsights solution, a fully populated predefined rack with one G826 switch and one G8052 switch can support up to 20 data nodes. However, the total number of data nodes that a rack can accommodate can vary based on the number of top-of-rack switches and management nodes that are required for the rack within the overall solution design. The number of data nodes can be calculated as follows: Maximum number of data dodes = (2U - (# U Switches + # U Management Nodes)) / 2 22 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
23 Edge nodes: This calculation does not consider edge nodes. Based on the client s choice of edge node, proportions can vary. Every two U edge nodes displace one data node, and every one 2U edge node displaces one data node. Deployment considerations This section describes the deployment considerations for the InfoSphere BigInsights and the InfoSphere BigInsights HBase reference architectures. Scalability The Hadoop architecture is linearly scalable. When the capacity of the existing infrastructure is reached, the cluster can be scaled out by adding more data nodes and, if necessary, management nodes. As the capacity of existing racks is reached, new racks can be added to the cluster. Some workloads might not scale linearly. When you design a new InfoSphere BigInsights reference architecture implementation, future scale out is a key consideration in the initial design. You must consider the two key aspects of networking and management. Both of these aspects are critical to cluster operation and become more complex as the cluster infrastructure grows. The networking model that is described in the section Networking configuration on page 8 is designed to provide robust network interconnection of racks within the cluster. As more racks are added, the predefined networking topology remains balanced and symmetrical. If there are plans to scale the cluster beyond one rack, initially design the cluster with multiple racks, even if the initial number of nodes might fit within one rack. Starting with multiple racks will enforce proper network topology and prevent future reconfiguration and hardware changes. Also, as the number of nodes within the cluster increases, many of the tasks of managing the cluster also increase, such as updating node firmware or operating systems. Building a cluster management framework as part of the initial design and proactively considering the challenges of managing a large cluster will pay off significantly in the end. Platform Cluster Manager, Standard Edition or Extreme Cloud Administration Toolkit (xcat), an open source project that IBM supports, are scalable distributed computing management and provisioning tools that provide a unified interface for hardware control, discovery, and operating system deployment. In contrast to the command-line scripting environment provided by xcat, Platform Cluster Manager, Standard Edition provides a robust and easy to use GUI-based tool that accelerates time to value for deploying, managing, and monitoring a clustered hardware infrastructure. Within the InfoSphere BigInsights reference architecture, the System x server IMM2 and the cluster management network provide an out-of-band management framework that management tools, such as Platform Cluster Manager or xcat, can use to facilitate or automate the management of cluster nodes. Proactive planning for future scale out and the development of cluster management framework as a part of initial cluster design provides a foundation for future growth that will minimize hardware reconfigurations and cluster management issues as the cluster grows. Availability When you implement an IBM InfoSphere BigInsights cluster on a System x server, consider the availability requirements as part of the final hardware and software configuration. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 23
24 Typically, Hadoop is considered a highly reliable solution. Hadoop and InfoSphere BigInsights best practices provide significant protection against data loss. Generally, failures can be managed without causing an outage. Redundancy can be added to make a cluster even more reliable. Consider both hardware and software redundancy. Customizing the predefined configurations The predefined configuration provides a baseline configuration for an InfoSphere BigInsights cluster and provides modifications for an InfoSphere BigInsights cluster that is running HBase. The predefined configurations represent a baseline configuration that can be implemented as is or modified based on specific client requirements, such as lower cost, improved performance, and increased reliability. When you consider modifying the predefined configuration, you must understand key aspects of how the cluster will be used. In terms of data, you must understand the current and future total data to be managed, the size of a typical data set, and whether access to the data will be uniform or skewed. In terms of ingest, you must understand the volume of data to be ingested and ingest patterns, such as regular cycles over specific time periods and bursts in ingest. Consider also the data access and processing characteristics of common jobs and whether query-like frameworks, such as IBM BigSQL, are used. When designing an InfoSphere BigInsights cluster infrastructure, we recommend conducting the necessary testing and proof of concepts against representative data and workloads to ensure that the proposed design will achieve the necessary success criteria. The following sections provide information about customizing the predefined configuration. When considering customizations to the predefined configuration, work with a systems architect who is experienced in designing InfoSphere BigInsights cluster infrastructures. Designing for high availability Designing for high availability entails assessing potential failure points and planning so that potential failure points do not impact the operation of the cluster. Whenever you address enhanced high availability, you must understand and consider the trade-offs between the cost of outage and the cost of adding redundant hardware components. Within an InfoSphere BigInsights cluster, several single points of failure exist: A typical Hadoop HDFS is implemented with a single NameNode service instance. A couple of options exist to address this issue. InfoSphere BigInsights 2. supports an active/standby redundant NameNode configuration as an alternative to the standard NameNode/Secondary NameNode configuration. For more information about a redundant NameNode service within an InfoSphere BigInsights cluster, see Management node configuration and sizing on page 6. Also, IBM General Parallel File System (GPFS )-FPO, included within InfoSphere BigInsights 2. Enterprise Edition, overcomes the potential for NameNode failures by not depending on stand-alone services to manage the distributed file system metadata. GPFS-FPO also has the added benefits of being POSIX-compliant, having more robust tools for online management of underlying storage, point-in-time snapshot capabilities, and off-site replication. For more information about GPFS-FPO, see GPFS-FPO considerations on page 29. Even with HDFS maintaining three copies of each data block, a disk drive failure can impact cluster functioning and performance. Nonetheless, the potential for disk failures can be addressed by using RAID 5 or RAID 6 to manage data storage within each data node. The JBOD controller that is specified within the predefined configuration can be 2 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
25 substituted with the ServeRAID M50. Implementing RAID 5 reduces the total available data disk storage by approximately 8.3%, and implementing RAID 6 reduces the total available data disk storage by approximately 6.6%. The use of RAID within Hadoop clusters is atypical and should be considered only for enterprise clients who are sensitive to disk failures because the use of RAID can impact performance. The data network in the predefined reference architecture configuration consists of a single network topology. The single G826 data network switch within each rack represents a single point of failure. Addressing this challenge can be achieved by building a redundant data network that uses an extra IBM RackSwitch G826 top-of-rack switch per rack and appropriate extra IBM RackSwitch G836 or G8332 core switches per cluster. Figure 0 shows the network cabling within a rack for a redundant data network. In a redundant data network configuration, one 0 GbE link from each node connects back to one G826 switch and the other connects back to the second (redundant) G826. Figure on page 26 shows how multiple racks are aggregated using redundant IBM System Networking G836 or G8332 switches. Figure 0 shows the redundant networking predefined connections within a rack. Figure 0 Redundant networking predefined configuration IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 25
26 Figure shows the cross rack networking by using the core switch. 0Gb 0Gb G836- G Gb 0Gb G826- G G826-3 G 826- G826-5 G826-6 G826-7 G826-8 G826-9 G826-0 Big Data Rack #2 Big Data Rack #3 Big Data Rack # Big Data Rack #5 Big Data Rack #6 G8052- Mgmt G Mgmt G Mgmt G8052- Mgmt G Mgmt Gb G826- Mgmt Port G826-3 Mgmt Port G826-5 Mgmt G826-7 Mgmt G826-9 G826-2 G826- G826-6 Mgmt G826-8 Mgmt G826-0 Mgmt Mgmt Mgmt Port Port Port Mgmt Port Edge Node Edge Node Edge Node Customer Network Admin Customer Network Data Uplinks from G 826 to G836 Customer Data Network Data Network (private IP addresses ) Admin/IMM Network (corporate IP addresses ) Customer Administration network Figure Cross rack networking Designing for high performance To increase cluster performance, you can increase data node memory or use a high-performance job scheduler, such as IBM Platform Symphony, within the Platform Symphony MapReduce framework. Often, improving performance comes at increased cost. Therefore, you must consider the cost/benefit trade-offs of designing for higher performance. In the InfoSphere BigInsights predefined configuration, data node memory can be increased to 28 GB by using sixteen 8 GB RDIMMs. However, in the HBase predefined configuration, data node memory is already set to28 GB. The maximum memory that can be placed within the x3650 M BD data node is sixteen 6 GB RDIMMs, which is a total of 256 GB. The impact of Platform Symphony MapReduce shuffle file and other temporary file I/O on data node performance can be workload dependent. In some cases, data node performance can be increased by utilizing solid-state disk (SSD) for Platform Symphony MapReduce shuffle files and other temporary files. The System x Reference Architecture for Hadoop: InfoSphere BigInsights data nodes utilize the N225 2 Gbps HBA. This HBA provides expanded bandwidth to exploit the performance-enhancing characteristics of placing Platform Symphony MapReduce shuffle files and other temporary files on SSD. When considering the 26 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
27 use of SSD, it is important to ensure consistency in SSD to HDD capacity proportions across all BigInsights cluster data nodes. Designing for lower cost Two key modifications can be made to lower the cost of an InfoSphere BigInsights reference architecture solution. When you consider lower-cost options, ensure that clients understand the potential lower performance implications of a lower-cost design. A lower-cost version of the InfoSphere BigInsights reference architecture can be achieved by using lower-cost data node processors and lower-cost cluster data network infrastructure. The data node processors can be substituted with the E GHz 6-core processor or the E GHz 6-core processor. These processors require 333 MHz RDIMMs, which can also lower the per-node cost of the solution. Using a lower-cost network infrastructure can significantly lower the cost of the solution, but can also have a substantial negative impact on intracluster data throughput and cluster ingest rates. To use a lower-cost network infrastructure, use the following substitutions in the predefined configuration: Within each node (data nodes and management nodes), substitute the Mellanox 0 GbE dual SFP+ network adapter with the extra ports on the integrated GBaseT adapters within the x3550 M and x3650 M BD. Within each rack, substitute the IBM RackSwitch G826 top-of-rack switch with the IBM RackSwitch G8052. Within each cluster, substitute the IBM RackSwitch G836 or G8332 core switch with the IBM RackSwitch G826. Although the network wiring schema is the same as the schema that is described in Networking configuration on page 8, the media types and link speeds within the data network are different. The data network within a rack that connects data nodes and management nodes to the lower-cost option G8052 top-of-rack switch is now based on two aggregated GBaseT links per node (management node and data node). The physical interconnect between the administration/management networks and the data networks within each rack is now based on two aggregated GBaseT links between the administration/management network G8052 switch and the lower-cost data network G8052 switch. Within a cluster, the racks are interconnected by using two aggregated 0 GbE links between the substitute G8052 data network switch in each rack and a lower-cost G826 core switch. Designing for high ingest rates Ingesting data into an InfoSphere BigInsights cluster is accomplished by using edge nodes that are connected to the cluster data network switches within each rack (IBM RackSwitch G826). For more information about cluster networking, see Networking configuration on page 8. For more information about edge nodes, see Edge node considerations on page 2. Designing for high ingest rates is not a trivial matter. You must have a full characterization of the ingest patterns and volumes. For example, you must know on which days and at what times the source systems are available or not available for ingest. You must know when a source system is available for ingest and the duration for which the system remains available. You must also know what other factors impact the day, time, and duration ingest constraints. In addition, when ingests occur, consider the average and maximum size of ingest that must be completed, the factors that impact ingest size, and the format of the source data IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 27
28 (structured, semi-structured, unstructured). Also determine whether any data transformation or cleansing requirements must be achieved during the ingest. If the client is using or planning to use ETL software for ingest, such as IBM InfoSphere DataStage, consult the appropriate ETL specialist, such as an IBM DataStage architect, to help size the appropriate edge node configuration. The key to successfully addressing a high ingest rate is to ensure that the number and physical attributes of edge nodes are sufficient for the throughput and processor needs for ingest and the ETL needs. Designing for higher per data node storage capacity or archiving In situations where higher capacity is required, the main design approach is to increase the amount of data disk space per data node. Using TB drives instead of 3 TB drives increases the total per data node data disk capacity 36-8 TB, which is a 33% increase. If TB drives are used as data disk drive, increase the OS drives to TB. When you increase data disk capacity, you must be cognizant of the balance between performance and throughput. For some workloads, increasing the amount of user data that is stored per node can decrease disk parallelism and negatively impact performance. The value, enterprise, and performance options Table 2 highlights potential modifications of the predefined configuration of InfoSphere BigInsights reference architecture for data nodes that provide a value option, the predefined configuration, and a performance option. These options represent three potential modification scenarios and are intended as example modifications. Any modification of the predefined configurations should be based in client requirements and are not limited to these examples. Table 2 System x reference architecture for InfoSphere BigInsights x3650 M BD data node options Value options Predefined options Performance options Processor 2 x E v2 2.2 GHz 6-core 2 x E v2 2.6 GHz 8-core 2 x E v2 2.8 GHz 0-core Memory - base 8 GB = 6 x 8 GB 866 MHz RDIMS 6 GB = 8 x 8 GB 866 MHz RDIMS 28 GB = 6 x 8 GB 866 MHz RDIMS a Disk (data and OS) Platform Symphony MapReduce: 3 or x 3 TB NL SATA 3.5-inch Platform Symphony MapReduce: 3 or x 3 TB or TB NL SATA 3.5-inch Platform Symphony MapReduce: 3 or x 3 TB or TB NL SATA 3.5-inch HBase: 6 to 2 data drives and or 2 OS drives x TB NL SATA 3.5-inch HBase: 6 to 2 data drives and or 2 OS drives x TB or 2 TB NL SATA 3.5-inch HBase: 6 to 2 data drives and or 2 OS drives x TB or 2 TB NL SATA 3.5-inch HDD controller N225 2 Gb JBOD Controller N225 2 Gb JBOD Controller N225 2 Gb JBOD Controller Hardware storage protection None (JBOD) None (JBOD) None (JBOD) Data network GbE 0 GbE 0 GbE a. The HBase predefined configuration for data nodes already specifies 96 GB of base memory. 28 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
29 GPFS-FPO considerations The GPFS is an enterprise-class, distributed, single namespace file system for high-performance computing environments that is scalable, offers high performance, and is reliable. The GPFS-FPO (File Placement Optimizer) is based on a shared nothing architecture so that each node on the file system can function independently and be self-sufficient within the cluster. Typically, GPFS-FPO can be a substitute for HDFS, removing the need for the HDFS NameNode, Secondary NameNode, and DataNode services. However, in performance sensitive environments, placing GPFS metadata on higher speed drives may improve performance of the GPFS file system. GPFS-FPO has significant and beneficial architectural differences from HDFS. HDFS is a file system based on Java that runs on top of the operating system file system and is not POSIX-compliant. GPFS-FPO is a POSIX-compliant, kernel-level file system that provides Hadoop a single namespace, distributed file system with some performance, manageability, and reliability advantages over HDFS. As a kernel-level file system, GPFS is free from the overhead that is incurred by HDFS as a secondary file system, running within a JVM on top of the operating systems file system. As a POSIX-compliant file system, files that are stored in GPFS-FPO are visible to authorized users and applications by using standard file access/management commands and APIs. An authorized user can list, copy, move, or delete files in GPFS-FPO by using traditional operating system file management commands without logging in to Hadoop. Additionally, GPFS-FPO has significant advantages over HDFS for backup and replication. GPFS-FPO provides point-in-time snapshot backup and off-site replication capabilities that significantly enhance cluster backup and replication capabilities. When substituting GPFS-FPO for HDFS as the cluster file system, the HDFS NameNode and Secondary NameNode daemons are not required on cluster management nodes, and the HDFS DataNode daemon is not required on cluster data nodes. From an infrastructure design perspective, including GPFS-FPO can reduce the number of management nodes that are required. Because GPFS-FPO distributes metadata across the cluster, no dedicated name service is needed. Management nodes within the InfoSphere BigInsights predefined configuration or BigInsights HBase predefined configuration that are dedicated to running the HDFS NameNode or Secondary NameNode services can be eliminated from the design. The reduced number of required management nodes can provide sufficient space to allow for more data nodes within a rack. For more information about implementing GPFS-FPO in an InfoSphere BigInsights solution, see the white paper entitled Deploying a big data solution using IBM GPFS-FPO, which can be found at the following link: _DC_ZQ_USEN&htmlfid=DCW0305USEN&attachment=DCW0305USEN.PDF Platform Symphony considerations InfoSphere BigInsights is built on Apache Hadoop, an open source software framework that supports data-intensive, distributed applications. By using open source Hadoop, and extending it with advanced analytic tools and other added value capabilities, InfoSphere BigInsights helps organizations of all sizes more efficiently manage the vast amounts of data that consumers and businesses create every day. IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 29
30 At its core, Hadoop is a Distributed Computing Environment (DCE) that manages the execution of distributed jobs and tasks on a cluster. As with any DCE, the Hadoop software must provide facilities for resource management, scheduling, remote execution, and exception handling. Although Hadoop provides basic capabilities in these areas, these areas are problems that IBM Platform Computing has been working to perfect for 20 years. IBM Platform Symphony is a low-latency scheduling solution that supports true multitenancy and sophisticated workload management capabilities. Platform Symphony Advanced Edition includes a Hadoop compatible Java Platform Symphony MapReduce API that is optimized for low-latency Platform Symphony MapReduce workloads. Higher-level Hadoop applications, such as Pig, Hive, Jaql, and other BigInsights components, run directly on the Platform Symphony MapReduce framework of Platform Symphony. Hadoop components, such as the Hadoop Platform Symphony MapReduce Version JobTracker and TaskTracker in Symphony, have been reimplemented as Platform Symphony applications. They take advantage of the fast middleware, resource sharing, and fine-grained scheduling capabilities of Platform Symphony. When Platform Symphony is deployed with InfoSphere BigInsights, Platform Symphony replaces the open source Platform Symphony MapReduce layer in the Hadoop framework. Platform Symphony itself is not a Hadoop distribution. Platform Symphony replaces the Platform Symphony MapReduce scheduling layer in the InfoSphere BigInsights software environment to provide better performance and multitenancy in a way that is transparent to InfoSphere BigInsights and InfoSphere BigInsights users. Clients who are deploying InfoSphere BigInsights or other big data application environments can realize significant advantages by using Platform Symphony as a grid manager: Better application performance Opportunities to reduce costs through better infrastructure sharing The ability to guarantee application availability and quality of service Ensured responsiveness for interactive workloads Simplified management by using a single management layer for multiple clients and workloads Platform Symphony will be especially beneficial to InfoSphere BigInsights clients who are running heterogeneous workloads that benefit from low latency scheduling. The resource sharing and cost savings opportunities that are provided by Platform Symphony extend to all types of workloads. Platform Cluster Manager Standard Edition considerations IBM Platform Cluster Manager Standard Edition is a cluster management software for deploying, monitoring, and managing scale-out compute clusters. It uses xcat as the embedded provisioning engine but hides the complexity of the open source tool. It also includes a scalable, flexible, and robust monitoring agent technology that shares the same code base with IBM Platform Resource Scheduler, IBM Platform LSF, and IBM Platform Symphony. The monitoring and management web GUI is powerful and intuitive. Platform Cluster Manager Standard Edition provides the following capabilities to the BigInsights environment: Bare metal provisioning: Platform Cluster Manager can quickly deploy operating system, device drivers, and BigInsights software components automatically to a bare metal cluster node 30 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
31 OS updates: Update and patch operating system on cluster nodes from the management console. If the updates do not change OS kernel, there is no need to reboot the node. This is less disruptive for the production environment Hardware management and control: Provides power control, firmware updates, server LED control, as well as various consoles to the nodes (BMC, SSH, VNC, and so on) System monitoring: It monitors both the hardware and system performance for all nodes in the BigInsights environment from its monitoring console. Custom monitoring metrics can be easily added by changing the configuration files. Once the custom metrics are added, an administrator can define alerts using these metrics as well as alert triggered actions. This automates the system management of the BigInsights cluster. GPFS monitoring: Platform Cluster Manager has built-in GPFS monitoring. GPFS is one of the optional distributed file systems supported by BigInsights. GPFS capacity and bandwidth are monitored from the management console to allow system administrators to correlate system and storage performance information for troubleshooting and capacity planning. IBM Platform Cluster Manager Standard Edition simplifies the deployment and management of the BigInsights infrastructure environment. Furthermore, using Platform Cluster Manager Standard Edition does not require any changes to your BigInsights software configuration. Hardware deployment and management node (optional) Within a BigInsights environment, the deployment and ongoing management of the hardware infrastructure is a non-trivial challenge. Deploying many nodes requires configuring node hardware in a consistent manner, along with the ability to apply node-specific hardware parameters, such as IP addresses and host names. Once deployment is complete, maintaining consistent operating system, driver, and firmware levels or efficiently making changes to the hardware configuration (for example, changing hardware tuning parameters) requires a robust toolset as clusters grow to hundreds of nodes and larger. Using a separate hardware deployment and management node within the cluster provides a platform that is independent of the wider cluster where tasks such as initial cluster deployment, scale-out node deployment, and ongoing hardware management can be performed. Such a node is ideal for housing hardware management tools such as Platform Cluster Manager Standard Edition and xcat, as well as the cluster hardware deployment and configuration state data that these tools maintain. Table 3 shows the recommended configuration for a hardware deployment and management node. Table 3 Optional hardware deployment and management node Recommended configuration Server Processor Memory - base Disk HDD controller Storage protection Network (Admin and IMM Networks) System x3550 E v2 2.5 GHz -core 6 GB ( x GB) 866 MHz RDIMM 2 x TB 3.5 NL SATA ServeRAID M5 for System x Hardware mirroring (RAID) Integrated Gb Ethernet ports a IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 3
32 a. The x3550 comes standard with two of its four integrated Gb Ethernet ports activated. These ports should be used to connect the hardware deployment and management nodes to both the cluster in-band administration and out-of-band Integrated Management Module (IMM) networks. Access to the IMM network is necessary for low-level hardware management tasks, such as applying firmware updates or modifying BIOS parameters. In environments where fault tolerance and high availability is required for the hardware deployment and management tools, the use of two hardware deployment and management nodes is recommended. General-purpose big data nodes This document has focused on using the herein-described predefined x3550 management node and x3650 M BD data node as core components of an InfoSphere BigInsights solution. However, these predefined nodes can often be used to support other big data workloads that are often associated with an InfoSphere BigInsights solution. The predefined data node, which is based on the System x3650 BD server, offers a storage rich, efficient memory, and outstanding uptime providing an ideal, purpose-built platform for big data workloads. The x3650 M BD is a purpose-built platform for big data workloads requiring high throughput and high capacity, such as databases like DB/2, ETL tools like InfoSphere DataStage, or Search and Discovery tools like InfoSphere Data Explorer. The predefined management node, which is based on the System x 3550 M server, offers a high memory, high throughput for management services, such as GPFS-FPO metadata servers and Platform Symphony management servers; or data-in-motion analytics tools, such as InfoSphere Streams. The x3650 M BD and the x3550 M can provide perfect platforms for many of the tools and services that are often a part of a complete solution. Predefined configuration bill of materials This section describes the predefined configuration bill of materials for IBM InfoSphere BigInsights. InfoSphere BigInsights predefined configuration bill of materials This section provides ordering information for the InfoSphere BigInsights predefined configuration bill of materials. This bill of materials is provided as a sample of a full rack configuration. It is intended as an example only. Actual configurations will vary based on geographic region, cluster size, and specific client requirements. 32 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
33 Data node Table lists the parts information for 8 data nodes. Table Data node bill of materials Part number Description Quantity 566 IBM System x3650 M BD 8 AT7 PCIe Riser Card 2 ( x8 LP for Slotless RAID) 8 AT6 PCIe Riser Card for slot ( x8 FH/HL + x8 LP Slots) 8 A2ZQ Mellanox ConnectX-3 EN Dual-port SFP+ 0GbE Adapter Select Storage devices; RAID configured by IBM is not required 8 A22S IBM 3TB 7.2K 6 Gbps NL SATA 3.5-inch G2HS HDD 252 ARW IBM System x 900W High Efficiency Platinum AC Power Supply 8 AWC System Documentation and Software, US English 8 AS Intel Xeon Processor E v2 8C 2.6 GHz 20 MB Cache 866 MHz 95W 8 AAS Additional Intel Xeon Processor E v2 8C 2.6 GHz 20 MB Cache 866 MHz 95W 8 A3QG 8 GB (x8 GB, Rx,.5V) PC3-900 CL3 ECC DDR3 866 MHz LP RDIMM 08 A3YY N225 SAS/SATA HBA for IBM System x 8 ARQ System x3650 M BD Planar 8 ARG System x3650 M BD Chassis ASM without Planar m, 0A/00-250V, C3 to IEC 320-C Rack Power Cable 8 ARR 3.5-inch Hot Swap BP Bracket Assembly, 2x ARS 3.5-inch Hot Swap Cage Assembly, Rear, 2 x Rack Installation >U Component 8 ARH BIOS GBM 8 ARJ L COPT, U RIASER CAGE - SLOT 2 8 ARK L COPT, U BUTTERFLY RIASER CAGE - SLOT 8 ARN x3650 M BD Agency Label 8 ARP Label GBM 8 A50F 2x2 HDD BRACKET 8 A207 Rail Kit for x3650 M BD, x3630 M, and x3530 M 8 A2M3 Shipping Bracket for x3650 M BD and 3630 M 8 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 33
34 Management node Table 5 lists the parts information for three management nodes. Table 5 Management node Part number Description Quantity 79FT System x3550 M 3 AH3 System x3550 M 2.5-inch Base Without Power Supply Select Storage devices; RAID configured by IBM is not required 3 AMZ ServeRAID M5 SAS/SATA Controller for System x 3 A22S IBM 3TB 7.2K 6Gbps NL SATA 3.5" G2HS HDD 6 AFD IBM 3.5" Hot Swap Filler 3 A228 IBM System x Gen-III Slides Kit 3 A229 IBM System x Gen-III CMA 3 AHH x3550 M 3.5" HS Assembly Kit 3 AML IBM Integrated Management Module Advanced Upgrade 3 AH5 System x 750W High Efficiency Platinum AC Power Supply 3 AHL System x3550 M PCIe Gen-III Riser Card 2 ( x6 FH/HL Slot) 3 A2ZQ Mellanox ConnectX-3 EN Dual-port SFP+ 0GbE Adapter 6 AHJ System x3550 M PCIe Riser Card ( x6 LP Slot) 3 AHP System Documentation and Software, US English 3 A3QL 6 GB ( x6 GB, 2Rx,.5V) PC3-900 CL3 ECC DDR3 866 MHz LP RDIMM 2 AH5 System x 750W High Efficiency Platinum AC Power Supply m, 0A/00-250V, C3 to IEC 320-C Rack Power Cable 6 A2U6 IBM System x Advanced Lightpath Kit 3 A3WR Intel Xeon Processor E v2 8C 2.6 GHz 20 MB Cache 866 MHz 95W 3 A3X9 Additional Intel Xeon Processor E v2 8C 2.6 GHz 20 MB Cache 866 MHz 95W with Fan Rack Installation of U Component 3 A3XM System x3550 M Planar 3 AHB System x3550 M System Level Code 3 AHD System x3550 M Agency Label GBM 3 Edge node Table 6 on page 35 lists the parts information for one edge node. 3 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
35 Table 6 Edge node Part number Description Quantity 79FT System x3550 M AH3 System x3550 M 2.5-inch Base Without Power Supply 5977 Select Storage devices; RAID configured by IBM is not required AMZ ServeRAID M5 SAS/SATA Controller for System x A2XD IBM 600 GB 0K 6 Gbps SAS 2.5-inch SFF G2HS HDD A228 IBM System x Gen-III Slides Kit A229 IBM System x Gen-III CMA AHG System x3550 M x 2.5-inch HDD Assembly Kit AML IBM Integrated Management Module Advanced Upgrade AH5 System x 750W High Efficiency Platinum AC Power Supply AHL System x3550 M PCIe Gen-III Riser Card 2 ( x6 FH/HL Slot) A2ZQ Mellanox ConnectX-3 EN Dual-port SFP+ 0GbE Adapter 2 AHJ System x3550 M PCIe Riser Card ( x6 LP Slot) AHP System Documentation and Software, US English A3QL 6 GB ( x6 GB, 2Rx,.5V) PC3-900 CL3 ECC DDR3 866 MHz LP RDIMM 8 AH5 System x 750W High Efficiency Platinum AC Power Supply m, 0A/00-250V, C3 to IEC 320-C Rack Power Cable 2 A2U6 IBM System x Advanced Lightpath Kit A3WR Intel Xeon Processor E v2 8C 2.6 GHz 20 MB Cache 866 MHz 95W A3X9 Additional Intel Xeon Processor E v2 8C 2.6 GHz 20 MB Cache 866 MHz 95W with Fan 2305 Rack Installation of U Component A3XM System x3550 M Planar AHB System x3550 M System Level Code AHD System x3550 M Agency Label GBM IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 35
36 Administration/management network switch Table 7 lists the parts information for the administration/management network switch. Table 7 Administration/management network switch Part number Description Quantity 7309HC IBM System Networking RackSwitch G8052 (Rear to Front) m, 0A/00-250V, C3 to IEC 320-C Rack Power Cable 2 ADK IBM 9-inch Flexible Post Rail Kit 2305 Rack Installation of U Component Data network switch Table 8 lists the parts information for the data network switch. Table 8 Data network switch Part number Description Quantity 7309HC3 IBM System Networking RackSwitch G826 (Rear to Front) m, 0A/00-250V, C3 to IEC 320-C Rack Power Cable 2 ADK IBM 9-inch Flexible Post Rail Kit 2305 Rack Installation of U Component Rack Table 9 lists the parts information for the rack. Table 9 Rack Part number Description Quantity 0RC e350 2U rack cabinet 602 DPI Single-phase 30A/208V C3 Enterprise PDU (US) 2202 Cluster 350 Ship Group 230 Rack Assembly - 2U Rack 230 Cluster Hardware & Fabric Verification - st Rack 27 U black plastic filler panel Cables Table 20 lists the parts information for the cables. Table 20 Cables Part number Description Quantity m Molex Direct Attach Copper SFP+ Cable m Molex Direct Attach Copper SFP+ Cable m Molex Direct Attach Copper SFP+ Cable IntraRack CAT5E Cable Service 2 36 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
37 InfoSphere BigInsights HBase predefined configuration bill of materials This section provides ordering information for the InfoSphere BigInsights HBase predefined configuration bill of materials. This bill of materials is provided as a sample of a full rack configuration. It is intended as an example only. Actual configurations vary based on geographic region, cluster size, and specific client requirements. Data node Table 2 lists the parts information for 8 data nodes. Table 2 Data node Part number Description Quantity 566 IBM System x3650 M BD 8 AT7 PCIe Riser Card 2 ( x8 LP for Slotless RAID) 8 AT6 PCIe Riser Card for slot ( x8 FH/HL + x8 LP Slots) 8 A2ZQ Mellanox ConnectX-3 EN Dual-port SFP+ 0GbE Adapter Select Storage devices; RAID configured by IBM is not required 8 A22T IBM 2TB 7.2K 6 Gbps NL SATA 3.5-inch G2HS HDD 252 ARW IBM System x 900W High Efficiency Platinum AC Power Supply 8 AWC System Documentation and Software, US English 8 AS AAS Intel Xeon Processor E v2 8C 2.6 GHz 20 MB Cache 866 MHz 95W Additional Intel Xeon Processor E v2 8C 2.6 GHz 20 MB Cache 866 MHz 95W 8 8 A3QG 8 GB (x8 GB, Rx,.5V) PC3-900 CL3 ECC DDR3 866 MHz LP RDIMM 08 A3YY N225 SAS/SATA HBA for IBM System x 8 ARQ System x3650 M BD Planar 8 ARG System x3650 M BD Chassis ASM without Planar m, 0A/00-250V, C3 to IEC 320-C Rack Power Cable 8 ARR 3.5-inch Hot Swap BP Bracket Assembly, 2x ARS 3.5-inch Hot Swap Cage Assembly, Rear, 2 x Rack Installation >U Component 8 ARH BIOS GBM 8 ARJ L COPT, U RIASER CAGE - SLOT 2 8 ARK L COPT, U BUTTERFLY RIASER CAGE - SLOT 8 ARN x3650 M BD Agency Label 8 ARP Label GBM 8 A50F 2x2 HDD BRACKET 8 A207 Rail Kit for x3650 M BD, x3630 M, and x3530 M 8 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 37
38 Part number Description Quantity A2M3 Shipping Bracket for x3650 M BD and 3630 M 8 Management node Table 22 lists the parts information for four management nodes. Table 22 Management node Part number Description Quantity 79FT System x3550 M AH3 System x3550 M 2.5-inch Base Without Power Supply 5977 Select Storage devices; RAID configured by IBM is not required AMZ ServeRAID M5 SAS/SATA Controller for System x A22S IBM 3TB 7.2K 6Gbps NL SATA 3.5" G2HS HDD 8 AFD IBM 3.5" Hot Swap Filler A228 IBM System x Gen-III Slides Kit A229 IBM System x Gen-III CMA AHH x3550 M 3.5" HS Assembly Kit AML IBM Integrated Management Module Advanced Upgrade AH5 System x 750W High Efficiency Platinum AC Power Supply AHL System x3550 M PCIe Gen-III Riser Card 2 ( x6 FH/HL Slot) A2ZQ Mellanox ConnectX-3 EN Dual-port SFP+ 0GbE Adapter 8 AHJ System x3550 M PCIe Riser Card ( x6 LP Slot) AHP System Documentation and Software, US English A3QL 6 GB ( x6 GB, 2Rx,.5V) PC3-900 CL3 ECC DDR3 866 MHz LP RDIMM 32 AH5 System x 750W High Efficiency Platinum AC Power Supply m, 0A/00-250V, C3 to IEC 320-C Rack Power Cable 8 A2U6 IBM System x Advanced Lightpath Kit A3WR Intel Xeon Processor E v2 8C 2.6 GHz 20 MB Cache 866 MHz 95W A3X9 Additional Intel Xeon Processor E v2 8C 2.6 GHz 20 MB Cache 866 MHz 95W with Fan 2305 Rack Installation of U Component A3XM System x3550 M Planar AHB System x3550 M System Level Code AHD System x3550 M Agency Label GBM 38 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
39 Administration/management network switch Table 23 lists the parts information for the administration/management network switch. Table 23 Administration/management network switch Part number Description Quantity 7309HC IBM System Networking RackSwitch G8052 (Rear to Front) m, 0A/00-250V, C3 to IEC 320-C Rack Power Cable 2 ADK IBM 9-inch Flexible Post Rail Kit 2305 Rack Installation of U Component Data network switch Table 2 lists the parts information for the data network switch. Table 2 Data network switch Part number Description Quantity 7309HC3 IBM System Networking RackSwitch G826 (Rear to Front) m, 0A/00-250V, C3 to IEC 320-C Rack Power Cable 2 ADK IBM 9-inch Flexible Post Rail Kit 2305 Rack Installation of U Component Rack Table 25 lists the parts information for the rack. Table 25 Rack Part number Description Quantity 0RC e350 2U rack cabinet 602 DPI Single-phase 30A/208V C3 Enterprise PDU (US) 2202 Cluster 350 Ship Group 230 Rack Assembly - 2U Rack 230 Cluster Hardware and Fabric Verification - st Rack Cables Table 26 lists the parts information for the cables. Table 26 Cables Part number Description Quantity m Molex Direct Attach Copper SFP+ Cable 3736 m Molex Direct Attach Copper SFP+ Cable m Molex Direct Attach Copper SFP+ Cable IntraRack CAT5E Cable Service IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 39
40 References For more information, see the following references: IBM General Parallel File System (GPFS) IBM Internet IBM Information Center ibm.cluster.infocenter.doc/infocenter.html IBM InfoSphere BigInsights 2. IBM Internet IBM Knowledge Center IBM Integrated Management Module II (IMM2) and Open Source xcat IBM IMM2 User's Guide ftp://ftp.software.ibm.com/systems/support/system_x_pdf/88y7599.pdf IMM and IMM2 Support on IBM System x and BladeCenter Servers, TIPS089 SourceForge xcat Wiki xcat 2 Guide for the CSM System Administrator, REDP-37 IBM Support for xcat m/systems/software/xcat/support.html+&cd=&hl=en&ct=clnk&gl=us&client=firefo x-a IBM Platform Computing IBM Internet IBM Platform Computing Integration Solutions, SG Implementing IBM InfoSphere BigInsights on System x, SG Integration of IBM Platform Symphony and IBM InfoSphere BigInsights, REDP SWIM Benchmark phony/highperfhadoop.html 0 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
41 IBM RackSwitch G8052 (GbE Switch) IBM Internet IBM System Networking RackSwitch G8052, TIPS083 IBM RackSwitch G826 (0GbE Switch) IBM Internet IBM System Networking RackSwitch G826, TIPS085 IBM RackSwitch G836 (6-port 0GbE Switch) IBM Internet IBM System Networking RackSwitch G836, TIPS082 IBM RackSwitch G836 (32-port 0GbE Switch) IBM Internet IBM System Networking RackSwitch G8332, TIPS39 IBM System x3550 M (Management Node, Edge Node, Deployment Node) IBM Internet IBM System x3550 M, TIPS085 IBM System x3650 M BD () IBM Internet IBM System x3650 M BD, TIPS02 IBM System x Reference Architecture for Hadoop: InfoSphere BigInsights: IBM Internet Implementing IBM InfoSphere BigInsights on System x, SG IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
42 Open source software Hadoop Avro Flume Hbase Hive Lucene Oozie Pig ZooKeeper Authors This paper was produced by a team of specialists from around the world working at the IBM International Technical Support Organization (ITSO), Raleigh Center. Steven Hurley is responsible for BigInsights on System x solution enablement for the Worldwide Big Data Systems Center and Technical Sales Readiness for IBM Systems Technology Group analytics offerings. Within Big Data System Center, Steven oversees the coordination of end-to-end InfoSphere BigInsights on System x HW and SW solutions and provides guidance to clients and sales teams regarding solution architecture and deployment services. Having over 7 years of experience within IT, Steve has held multiple technical and leadership roles in his career. James C. Wang is an IBM Senior Certified Consulting IT Specialist who is working as a lead solution architect of the Worldwide Big Data Systems Center. He has 29 years of experience at IBM in Server Systems availability and performance. He has worked in various leadership roles in Smart Analytics technical sales support, Worldwide Design Center for IT Optimization and Business Flexibility, Very Large Database Competency Center, and IBM pseries Benchmark Center. James is responsible for leading the BDSC team, providing System x reference architectures for Big Data offerings, developing technical sales education and sales support material, and providing technical sales support. Thanks to the following people for their contributions to this project: David Watts IBM ITSO, Raleigh Center 2 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
43 Bruce Brown Big Data Sales Acceleration Architect Benjamin Chang Consulting IT Specialist, System x, Global Techline Neeta Garimella Big Data and Cloud Leader, GPFS Development Belinda Harrison Program Manager, Systems and Technology Group (STG), Big Data Systems Center Yonggang Hu Chief Architect, Application Middleware, IBM Platform Computing Zane Hu Architect, IBM Platform Symphony Ray Perry STG Lab Services Gord Sissions Product Marketing Manager, IBM Platform Computing Scott Strattner Network Architect, STG Poughkeepsie Benchmark Center Stewart Tate STSM, Information Management Performance Benchmarks and Solutions Development Dave Willoughby STSM, System x Optimized Solutions Hardware Architect Joanna Wong Executive IT Specialist Now you can become a published author, too! Here s an opportunity to spotlight your skills, grow your career, and become a published author all at the same time! Join an ITSO residency project and help write a book in your area of expertise, while honing your experience using leading-edge technologies. Your efforts will help to increase product acceptance and customer satisfaction, as you expand your network of technical contacts and relationships. Residencies run from two to six weeks in length, and you can participate either in person or as a remote resident working from your home base. Find out more about the residency program, browse the residency index, and apply online at: ibm.com/redbooks/residencies.html IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture 3
44 Stay connected to IBM Redbooks Find us on Facebook: Follow us on Twitter: Look for us on LinkedIn: Explore new IBM Redbooks publications, residencies, and workshops with the IBM Redbooks weekly newsletter: Stay current on recent Redbooks publications with RSS Feeds: IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
45 Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-ibm product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-ibm websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning non-ibm products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-ibm products. Questions on the capabilities of non-ibm products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. Copyright International Business Machines Corporation 203, 20. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. 5
46 This document REDP was created or updated on June 5, 20. Send us your comments in one of the following ways: Use the online Contact us review Redbooks form found at: ibm.com/redbooks Send your comments in an to: Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P South Road Poughkeepsie, NY U.S.A. Redpaper Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol ( or ), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: BigInsights BladeCenter DataStage GPFS IBM InfoSphere pseries RackSwitch Redbooks Redpaper Redbooks (logo) Symphony System x The following terms are trademarks of other companies: Intel, Intel Xeon, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Java, and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Other company, product, or service names may be trademarks or service marks of others. 6 IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture
IBM System x reference architecture for Hadoop: MapR
IBM System x reference architecture for Hadoop: MapR May 2014 Beth L Hoffman and Billy Robinson (IBM) Andy Lerner and James Sun (MapR Technologies) Copyright IBM Corporation, 2014 Table of contents Introduction...
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Apache Hadoop Cluster Configuration Guide
Community Driven Apache Hadoop Apache Hadoop Cluster Configuration Guide April 2013 2013 Hortonworks Inc. http://www.hortonworks.com Introduction Sizing a Hadoop cluster is important, as the right resources
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
IBM Big Data HW Platform
IBM Big Data HW Platform Turning big data into smarter decisions Mujdat Timurcin IT Architect IBM Turk [email protected] September 29, 2013 Big data is a hot topic because technology makes it possible
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Apache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop
Lenovo Big Data Reference Architecture for MapR Distribution including Apache Hadoop Last update: 17 September 2015 Configuration reference number: BDAMAPRXX53 MapR delivers on the promise of Hadoop with
Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014
Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ Cloudera World Japan November 2014 WANdisco Background WANdisco: Wide Area Network Distributed Computing Enterprise ready, high availability
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Lenovo ThinkServer and Cloudera Solution for Apache Hadoop
Lenovo ThinkServer and Cloudera Solution for Apache Hadoop For next-generation Lenovo ThinkServer systems Lenovo Enterprise Product Group Version 1.0 December 2014 2014 Lenovo. All rights reserved. LENOVO
Networking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director [email protected] Dave Smelker, Managing Principal [email protected]
NetApp Solutions for Hadoop Reference Architecture
White Paper NetApp Solutions for Hadoop Reference Architecture Gus Horn, Iyer Venkatesan, NetApp April 2014 WP-7196 Abstract Today s businesses need to store, control, and analyze the unprecedented complexity,
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (
Apache Hadoop 1.0 High Availability Solution on VMware vsphere TM Reference Architecture TECHNICAL WHITE PAPER v 1.0 June 2012 Table of Contents Executive Summary... 3 Introduction... 3 Terminology...
Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Understanding Big Data and Big Data Analytics Getting familiar with Hadoop Technology Hadoop release and upgrades
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Certified Big Data and Apache Hadoop Developer VS-1221
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack HIGHLIGHTS Real-Time Results Elasticsearch on Cisco UCS enables a deeper
Dell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert [email protected]/
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads
Solution Overview Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads What You Will Learn MapR Hadoop clusters on Cisco Unified Computing System (Cisco UCS
CDH AND BUSINESS CONTINUITY:
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp
Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp Introduction to Hadoop Comes from Internet companies Emerging big data storage and analytics platform HDFS and MapReduce
The Greenplum Analytics Workbench
The Greenplum Analytics Workbench External Overview 1 The Greenplum Analytics Workbench Definition Is a 1000-node Hadoop Cluster. Pre-configured with publicly available data sets. Contains the entire Hadoop
How Cisco IT Built Big Data Platform to Transform Data Management
Cisco IT Case Study August 2013 Big Data Analytics How Cisco IT Built Big Data Platform to Transform Data Management EXECUTIVE SUMMARY CHALLENGE Unlock the business value of large data sets, including
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
Dell Reference Configuration for Hortonworks Data Platform
Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution
Apache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer [email protected], twitter: @awadallah Hadoop Past
HadoopTM Analytics DDN
DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Get More Scalability and Flexibility for Big Data
Solution Overview LexisNexis High-Performance Computing Cluster Systems Platform Get More Scalability and Flexibility for What You Will Learn Modern enterprises are challenged with the need to store and
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst
White Paper EMC s Enterprise Hadoop Solution Isilon Scale-out NAS and Greenplum HD By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst February 2012 This ESG White Paper was commissioned
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2016 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
HadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
Big + Fast + Safe + Simple = Lowest Technical Risk
Big + Fast + Safe + Simple = Lowest Technical Risk The Synergy of Greenplum and Isilon Architecture in HP Environments Steffen Thuemmel (Isilon) Andreas Scherbaum (Greenplum) 1 Our problem 2 What is Big
Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp
Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp Agenda Hadoop and storage Alternative storage architecture for Hadoop Use cases and customer examples
<Insert Picture Here> Big Data
Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Big Data - Infrastructure Considerations
April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
SUN ORACLE EXADATA STORAGE SERVER
SUN ORACLE EXADATA STORAGE SERVER KEY FEATURES AND BENEFITS FEATURES 12 x 3.5 inch SAS or SATA disks 384 GB of Exadata Smart Flash Cache 2 Intel 2.53 Ghz quad-core processors 24 GB memory Dual InfiniBand
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Enabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how EMC Elastic Cloud Storage (ECS ) can be used to streamline the Hadoop data analytics
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Big Data Introduction
Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
OnX Big Data Reference Architecture
OnX Big Data Reference Architecture Knowledge is Power when it comes to Business Strategy The business landscape of decision-making is converging during a period in which: > Data is considered by most
IBM System x reference architecture solutions for big data
IBM System x reference architecture solutions for big data Easy-to-implement hardware, software and services for analyzing data at rest and data in motion Highlights Accelerates time-to-value with scalable,
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Getting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
Dell Cloudera Solution Reference Architecture v2.1.0
Dell Cloudera Solution Reference Architecture v2.1.0 A Dell Reference Architecture Guide November 2012 Next Generation Cloud Solutions Table of Contents Tables 3 Figures 4 Overview 5 Summary 5 Abbreviations
Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters
Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters Table of Contents Introduction... Hardware requirements... Recommended Hadoop cluster
How To Write An Article On An Hp Appsystem For Spera Hana
Technical white paper HP AppSystem for SAP HANA Distributed architecture with 3PAR StoreServ 7400 storage Table of contents Executive summary... 2 Introduction... 2 Appliance components... 3 3PAR StoreServ
Storage Architectures for Big Data in the Cloud
Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013 Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
Hadoop Hardware @Twitter: Size does matter. @joep and @eecraft Hadoop Summit 2013
Hadoop Hardware : Size does matter. @joep and @eecraft Hadoop Summit 2013 v2.3 About us Joep Rottinghuis Software Engineer @ Twitter Engineering Manager Hadoop/HBase team @ Twitter Follow me @joep Jay
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
HADOOP MOCK TEST HADOOP MOCK TEST I
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
HP Reference Architecture for Hortonworks Data Platform on HP ProLiant SL4540 Gen8 Server
Technical white paper HP Reference Architecture for Hortonworks Data Platform on HP Server HP Converged Infrastructure with the Hortonworks Data Platform for Apache Hadoop Table of contents Executive summary...
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
White Paper. Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference Configurations
White Paper Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference Configurations Contents Next-Generation Hadoop Solution... 3 Greenplum MR: Hadoop Reengineered... 3 : The Exclusive
HP Reference Architecture for Cloudera Enterprise
Technical white paper HP Reference Architecture for Cloudera Enterprise HP Converged Infrastructure with Cloudera Enterprise for Apache Hadoop Table of contents Executive summary 2 Cloudera Enterprise
Intro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
Virtualizing Apache Hadoop. June, 2012
June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING
Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
HP Reference Architecture for Cloudera Enterprise on ProLiant DL Servers
Technical white paper HP Reference Architecture for Cloudera Enterprise on ProLiant DL Servers HP Converged Infrastructure with Cloudera Enterprise for Apache Hadoop Table of contents Executive summary...
docs.hortonworks.com
docs.hortonworks.com Hortonworks Data Platform : Cluster Planning Guide Copyright 2012-2014 Hortonworks, Inc. Some rights reserved. The Hortonworks Data Platform, powered by Apache Hadoop, is a massively
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Chase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
