Data Mining in the Swamp

Transcription

1 WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all too often, the data you have is difficult to process by typical BI tools. These failures generally come in two areas: 1. The data is too voluminous to be properly digested by your BI system. 2. The data records are messy, inconsistent and difficult to join together. Each of these problems is commonplace, and relatively easy to solve in isolation. High volume datasets can be mastered by simply (if expensively) throwing more hardware and software at the problem larger servers, cluster licenses, faster networks, bigger memories, faster disks, etc. Messy data can be cleansed with the appropriate use of script and SQL logic to make records consistent and well defined. But what do you do when datasets are both large and messy? As we have learned from a recent project with a large financial institution, large amounts of data that are difficult to correlate can bring down even a state of the art BI system. But large volumes of messy data are a fact of life indeed; it's probably the bulk of the data in the enterprise. Add to that the complexity and cost associated with trying to tame it, the business case for analyzing it is overwhelmed. large volumes of messy data are a fact of life indeed; it's probably the bulk of the data in the enterprise Enter the Cloud One of the primary concepts in cloud computing is low cost scalability systems that can grow to handle larger volumes of users and data by adding more low cost hardware. Google's entire infrastructure is built on this approach of distributing work out to thousands of inexpensive servers, instead of relying on centralized "supercomputers" to provide the horsepower. The scalability strategy that Google uses is called MapReduce. The MapReduce model provides a conceptual framework for dividing work up into small, manageable sets that can be distributed across 1 or 10 or 100 or 1000 or even servers, which can all work in parallel. This technology can be used with BI to meet the challenge of large scale, messy data, but you can t use Google s infrastructure to run your own MapReduce system. Luckily, there s Hadoop an open source implementation of the Google MapReduce system. Even though it s technically still in Beta, Hadoop is in use at many large organizations, including: Amazon Yahoo Facebook Adobe The New York Times AOL Twitter Rackspace

2 WHITE PAPER Page 2 of 8 Introducing Hadoop In 2004, Google published papers describing their Google File System and MapReduce algorithms. Doug Cutting, a Yahoo employee and Open Source Evangelist, partnered with a friend to create Hadoop 1, an opensource implementation of GFS and MapReduce. In essence, Hadoop was a software system that could handle arbitrarily large amounts of data using a distributed file system, and distribute it to be worked on by an arbitrary number of workers, using MapReduce. Adding more storage or more workers is simply a matter of connecting new machines to the network there is no need for larger devices or specialized disks or specialized networking. The two main parts of Hadoop are: 1. HDFS 2. MapReduce Name Node Hadoop Distributed File System Map/Reduce Workers 1 Named after his son s stuffed elephant.

3 WHITE PAPER Page 3 of 8 HDFS HDFS (Hadoop Distributed File System) is a system for managing files that runs "on top of" standard computers and standard operating systems. When a file is loaded into HDFS, the master Name Node invisibly breaks these files into large chunks, and stores them in multiple places (for redundancy) on the native file systems of the computers in the "cluster". There's no requirement that the disks be the same size or that the computers be the same as the others in the "cluster". When a file is retrieved from HDFS, the Name Node fetches the chunks from the appropriate machines, re assembles it and delivers it to the caller 2. Data File Name Node The Data File is broken up into 64MB chunks The chunks are replicated 3 times and scattered amongst the workers 2 This is a simplification of the actual process, which is a lot more technically sophisticated.

4 WHITE PAPER Page 4 of 8 MapReduce MapReduce is a three step process that provides a structure for analyzing data and manipulating it in a scalable way. The three steps are: 1. Map 2. Shuffle/Sort 3. Reduce Map Raw data is translated/standardized/manipulated usually in fairly "lightweight" ways. The output of the Map step is a key value pair, which represents some sort of unique or nearly unique 3 key, and then whatever data is needed later (in the Reduce step) for the value. Shuffle/Sort The "secret sauce" of Hadoop is the distributed sort, where the records are all sorted by key (using either a default alphabetical sort or another comparator of your choice). Once records are sorted by key, all the records with the same key are sent to the same Reduce processor essentially this represents a way to group data intelligently. the secret sauce of Hadoop is the distributed sort, where the records are all sorted by key Reduce In the last step, these groups of records with the same keys are handed one by one to the Reduce task. Sometimes, the Reduce step will cache all of the records, so it can operate on the entire group at one time (often to perform aggregations). Other translations and manipulations may occur here for example, the data might be output in a format that's easier to import into a database. Finally, the resulting data is written back out to HDFS, and the job is done. Map Map workers process the local chunks of the data file(s) and create sortable keys for the records so the records can be grouped. 3 In many cases, you want the same key to be used for multiple records, for grouping purposes.

5 WHITE PAPER Page 5 of 8 Shuffle/Sort The shuffle/sort groups the records and redistributes them amongst the workers, so they can be reduced. Reduce The Reduce steps processes the new local record groups, and then outputs reduced records to HDFS. HDFS

6 WHITE PAPER Page 6 of 8 How Does This Help BI? Imagine you have data where some of the dimensions are well defined, but others change over time, in non trivial ways. Imagine that the data is in many different places, in many different formats, and you want to create a "holistic" view of the data. Last, but not least, imagine that the size of the overall dataset is so large that it will swamp the capabilities of your BI tool. How do you solve this problem? The Traditional Way (Take a deep breath) You can write translators for the different datasets but if the dataset is large, those translators will take a long time to run. So you consider manually splitting these large datasets into smaller sets, but then, of course, you have to get the data onto all of the computers, get the scripts running, and the computers need to have enough storage for the subset of the data. Once you clean the data, you have to rejoin the cleansed data together again from the multiple machines, and if you need to sort the data to help you aggregate it, you're going to have to find a sort solution that works on the huge volumes of data that you're dealing with. Odds are, you'll have to sort smaller subsets of the overall dataset, and then find ways to merge the subsets back together. Then you still haven't dealt with the fact that you need to aggregate it, so you have to write more scripts, divide the data into subsets again, and make sure you got all the records that belong in a group on the same machine. Every step in the above scenario is error prone, complex, difficult to predict and, if you have to do this process on a regular basis, probably maddeningly tedious. every step in the [traditional way] is error prone, complex, difficult to predict and maddeningly tedious Hadoop to the Rescue Instead, consider this option: 1. Load all the datasets into the HDFS. Hadoop will take care of how to partition the data, where to put it, and it also handles redundancy. (In other words, you don't need to use RAID in your Hadoop solution) 2. Write Map jobs that will take the data from each of the formats, and clean them, organizing the data into a general format. 3. Specify how to sort the data to properly group it 4. Write Reduce jobs to aggregate the data averaging and summing columns as needed, and then outputting the final aggregated data into a SQL friendly format 5. Now you run the Map/Sort/Reduce process on the data. 6. At the end of this process, you pull the aggregated data out of HDFS, and load it into your BI system. You'll need to perform some quality checking on the output, but if the steps have been done properly, you have a repeatable, scalable process for generating aggregated data, with minimal manual intervention. Some Current Uses of Hadoop: Product Search Index generation Data Aggregation & Rollup Data mining for ad targeting ETL Analyzing & Storing Logs Data Analytics RDF Indexing

7 WHITE PAPER Page 7 of 8 Where Hadoop Doesn't Fit Hadoop is a tool like any other, and it is not applicable to every problem. Some areas where Hadoop is not the right solution: Highly Interdependent Data Hadoop is not well suited for data where each record is heavily dependent on a number of other records. For example, consider weather forecasting predicting what's going to happen to a storm front over time requires a view of pretty much the entire dataset at once. This is the realm for supercomputers. Ad Hoc, "Casual" BI Hadoop provides a framework for data analysis, but it usually requires a fairly sophisticated user to create queries and aggregations. And, of course, there's little to no support for visualizations, etc. Hadoop's SQL related sub projects, such as Hive and Pig help mitigate some of this, but casual reporting is still somewhat difficult. Real time Processing Hadoop is designed to trade off startup speed for scalability and parallelism in other words, Hadoop is more like a locomotive than a sports car it takes a fairly long time to get everything set up, but once it's moving, it's doing a lot of work. Dependencies on Other Systems If you set up a 10,000 node Hadoop cluster to process a huge dataset, and one of the steps involves a query to a database on an old computer in a dusty corner of the datacenter, that database server is going to be a bottleneck for the entire cluster. Hadoop jobs work best when they have few (or best case, none) dependencies on external systems. There are various tricks and strategies that can mitigate this problem, but in general, remove as much external dependency as possible from your Hadoop jobs. Hadoop jobs work best when they have few (or best case, none) dependencies on external systems Conclusions In terms of Business Intelligence, Hadoop is a tool that makes the ETL process easier, and can bring the size and quality of data under control. It provides low cost scalability, a reliable and easily expandable file system, and a framework for dramatically increasing the scope and robustness of your data mining and data analysis business strategies. In situations where your data is too large, too messy or both, Hadoop can help you get it under control, and focus on your business, instead of focusing on one off IT infrastructure data analysis projects. Want to Learn More? Key Websites: o Powered by Hadoop: o Books: o Hadoop: The Definitive Guide (O Reilly) o Hadoop In Action (Manning) BI Related: o o

8 WHITE PAPER Page 8 of 8 Case Study MATRIX provided technical and architectural support for a large financial institution that was attempting to reconcile data between multiple bank accounts, representing seven (7) years worth of account history. The Scenario The system had more than 50 servers, more than 80 cores, and contained over two (2) petabytes of accessible HDFS storage. Multiple Hadoop Map/Reduce jobs were run in series to manipulate the data: Cleanse and Align the different account formats Add additional information to each record Reconcile duplicate accounts over time using a fuzzy logic subsystem Aggregate data from the accounts on a month by month basis The Results Without any optimization, the system was capable of processing one (1) month of data in approximately 30 minutes. Including data loading and testing, the full run took approximately ten (10) days of computing time, over the course of several weeks. About the Author John Brothers John is a veteran software architect/developer with 18 years of professional software development experience in multiple industries, including high energy physics, telephony, Internet, transaction management, health care and data visualizations. He has worked for large and small organizations in a number of roles, including developer, sales engineer, architect, director of development and CTO. He holds two patents in the area of Visitor based networking. John is an experienced "hands on" agile coach and engineer, with a strong background in the integration of agile tools for CI, automated testing, etc. He is expert with development in Java/J2EE, Ruby on Rails, Groovy, Grails and Flex. About MATRIX MATRIX is a leading full service IT staffing and professional services firm, providing top quality IT candidates to fill both contract consulting and permanent positions, and professional services engagements. Privately held, MATRIX last year had revenues of $165 million. Headquartered in Atlanta, we have offices nationwide with more than 200 internal employees and 1,400 staff contract consultants. In 2008, MATRIX was named one of the Best 50 Small and Medium Companies to Work for in America.