OnX Big Data Reference Architecture
Knowledge is Power when it comes to Business Strategy The business landscape of decision-making is converging during a period in which: > Data is considered by most to be an organization s most important asset. > The data volume is increasing at an unbelievable rate, with total volume expected to reach 35 zetabytes by 2020 a number that is 44 times over the total volume of data that was handled in 2009. > Unstructured / untraditional data sources, such as customer behavior in social media networks and data from connected machines, are contributing to a majority of this data growth. Business Analytics the science of turning this data into usable information is playing a major role in supporting business decisions. In successful organizations, analytics are no longer controlled by technical teams. Rather, business users are driving analytics, with technical teams focusing primarily on timely data delivery and the underlying infrastructure to support the analytics. The Concept of Big Data There is no standard definition for Big Data. Typically, an organization considers a data volume Big if it is bigger than anything that has been historically managed within that organization. The thing is, data is exploding for every industry, business and customer. Traditional infrastructure and data management architecture cannot meet the demands of today s data growth. Also, most of the incoming data is via an unstructured format through all sorts of sources including customer behavior in web, social networks, geographic location, syndicated data, government data, machine data, etc. Organizations must get a handle on managing the data volume and understanding/leveraging all the data to make informed, timely business decisions or risk falling behind. OnX Big Data Reference Architecture 1
Big Data Characteristics Big Data can be characterized by 4 V s: Volume Variety, Velocity and Veracity. > Volume: As discussed earlier, the data volume is growing at an exponential rate and will continue to do so. There is no doubt that technology must be in place to enable business users to make use of this huge surge in data volume. > Variety: Also discussed earlier, data is no longer arriving solely from internal applications with very known and rigid structures. It s now generated from a variety of sources, including unstructured ones. Business users must make sense of this data to enhance their decision making. There are two aspects of Variety to consider: Syntax and Semantics. In the past, these have determined the extent to which data could be reliably Extracted, Transformed, and Loaded (ETL) into a database and then analyzed. While modern ETL tools are very capable of dealing with data that arrives in virtually any syntax, they are less capable of dealing with semantically rich data, such as free text. Because of this, most organizations have been restricted to data analysis of a more narrow range of data. The value that Big Data technology brings with its inclusion of data of all syntaxes and semantics is perhaps one of its major appeals. > Velocity: The rate at which data is being received and has to be acted upon is becoming much more real-time. Delays in execution will inevitably limit the effectiveness of campaigns, limit interventions or lead to sub-optimal processes. For example, a discount offer to a customer based on their web browsing in a public terminal will not be successful once they log out of that terminal. > Veracity: This is related to its basic unstructured nature and occurrence from various sources, some with questionable reliability. If you can t trust the data, you have a veracity problem. Data needs to be clean and intact in order for it to be accurately leveraged. Sometimes it s important to supplement the incoming data with additional information from knowledge gathered through other sources, to get a complete and accurate understanding of the data, rather than depending on incoming data alone. OnX Big Data Reference Architecture 2
Big Data Technology and Infrastructure Big Data: The Rebel with a (MapReduce) Cause The Big Data architecture doesn t follow some of the basic principles of data architecture and data management. One of the fundamental principles of data architecture involves understanding the usage of data before it s captured. With Big Data technologies, business users are often interested in capturing data at the point of occurrence, with the expectation that utility will be understood at a future point in time. Also, data is no longer brought towards one processing engine and found in one place. Big Data technology breaks down the entire workload in small, manageable chunks and spreads it across multiple servers, with result sets being accumulated afterwards to provide an answer. This operation is called Hadoop MapReduce, with the first step focused on Mapping a job to all servers, and later on Reducing the result set to a desired number of targets. Hadoop provides significant improvement of Velocity (time-to-market) and Veracity (reliability) compared to traditional data architecture. At a minimum, data is replicated three times across the platform, so that there is minimal potential for data loss. The Hadoop Alley-Oop: Cost Benefits Big Data technologies provide a very appealing cost advantage over traditional data infrastructure. Hadoop, the open-source software framework for large scale data processing, costs merely hundreds of dollars per TB of data, as compared to thousands of dollars per TB in traditional storage architecture. With this enormous processing power that is spread across the nodes to process smaller chunks of data in less costly commodity platforms, Hadoop technologies provide exceptional value to both business and IT. SAN Storage $2-$10/Gigabyte $1M gets: 0.5 Petabytes 200,000 IOPS 1 GB/sec SAN Storage $1-$5/Gigabyte $1M gets: 1 Petabyte 400,000 IOPS 2 GB/sec Local Storage $0.05/Gigabyte $1M gets: 20 Petabytes 10,000,000 IOPS 800 GB/sec OnX Big Data Reference Architecture 3
Reference Architecture OnX has existing partnerships with most major names in the industry and has built and delivered Hadoop platforms on various infrastructure architectures, including IBM, HP and Oracle, among others. With the increased number of vendors in the Big Data space and increased industry acceptance of the Hadoop platform, it was important for OnX to focus on a Big Data reference architecture. The OnX Big Data reference architecture is built on Cisco UCS servers, with StackIQ as Cluster Manager and MapR Hadoop distribution. Using this reference architecture, OnX implemented Hadoop cluster for a major financial customer, for their customer loyalty program and product recommendations. The cluster was scaled up to over 1,200 servers to meet the customer s requirements. The financial customer saved millions of dollars over infrastructure cost through the implementation. Reference Architecture Details The OnX Big Data Reference Architecture has been configured to support two types of business requirements: High Performance and High Storage Capacity. OnX Big Data Reference Architecture 4
Hadoop Distribution with MapR The Apache Hadoop solution delivered by MapR Technologies introduces a completely new way of handling big data. Unlike traditional databases that store structured data, Hadoop enables distribution and analysis of structured and unstructured data smoothly on a single data infrastructure. MapR has the strongest foundation of available Hadoop distributions. Although they support Apache Hadoop standards and underlying technology, they have significant performance advantages over other distributions. The fundamental difference of the MapR file system is shown below. HBase JVM DFS JVM ext3 Disks HBase JVM MapR Disks Unified Disks Other Distributions MapR M5 Edition Edition MapR has now unified tables and files into a unified data platform - there is no separate HBase infrastructure. The environment is much simpler to manage by eliminating the various redundant components. There is a uniform data management layer across files and tables that provide a consistent data protection layer. Additionally, recovery from node failures is in seconds, there is 100% data locality, and HBase can read directly from snapshots. Furthermore, Files and tables are in the same namespace, volumes, and directories. OnX Big Data Reference Architecture 5
The underlying clusters are managed by StackIQ, which is based on Rocks open source cluster management software. It provides fully integrated Big Data platform that is extremely easy to manage. StackIQ Hadoop provisioning tool installs and configures all the required software and services on bare metal to a working Hadoop cluster. The installation can be completed in minutes, even for large Hadoop distributions involving hundreds of nodes. Cluster management software provides a GUI interface for managing clusters including addition/drop of nodes, upgrade of individual nodes, implementation of patches in entire cluster or individual nodes etc. The Complete Big Data Management Platform Map Reduce MapR-FS Monitoring Network Mgmt Disk Mgmt CONFIGURE DEPLOY MANAGE SCALE Operating System UCS Manager OnX Big Data Reference Architecture 6
Summary The reference architecture enables organizations to implement a starter package for Big Data, with a highly flexible architecture that can later be scaled to meet an organization s data & analytics needs. The High Capacity and High Performance set up can be architected with fewer servers for small or medium packages with 4 and 8 nodes respectively as well and scaled up to full 16 nodes configuration later when there is demand for additional storage and/or performance. Small Medium Large 4 nodes UCS C240 (2.9 GHz) 8 nodes UCS C240 (2.9 GHz) 16 nodes UCS C240 (2.9 GHz) 21 TB per node 21 TB per node 21 TB per node 84 TB total capacity 168 TB total capacity 336 TB total capacity High Performance 4 nodes UCS C240 (2.6 GHz) 8 nodes UCS C240 (2.6 GHz) 16 nodes UCS C240 (2.6 GHz) 36 TB per node 36 TB per node 36 TB per node High Capacity 144 TB total capacity 288 TB total capacity 576 TB total capacity Ready to learn more? Contact your local OnX Account Executive, or call 1-866-906-4669. www.onx.com OnX Big Data Reference Architecture 7