September 10-13, 2012 Orlando, Florida. Another Buzz Word - Hadoop! Or is That Something a Regular Person Can Use?

September 10-13, 2012 Orlando, Florida Another Buzz Word - Hadoop! Or is That Something a Regular Person Can Use?

Learning Points What is Hadoop? What is Map Reduce? Ideas on when to use it 2

Hadoop Hadoop & Big Data are buzz words and as such used by many for many reasons Hadoop is actually three things in one A clustered file system with fault tolerance A Map-Reduce execution engine An infrastructure to allow parallel execution in clusters Apache Hadoop is the first system enabling massively parallel computing in a cheap way to average companies What do you need a supercomputer for? 3

Atlas Experiment at the LHC LHC has a 26 659m ring perimeter Light speed is 300 000 000 m/s hence One packet revolves 11 000 times/sec 2800 packets simultaneously in the ring hence up to 40 million collisions per second 4

Atlas Experiment at LHC Results is 40 000 000 measurements per second Atlas has 150 000 000 electric connectors 40 * 10 6 * 150 * 10 6 = 6 * 10 15 values per second Note: 1TB = 1 * 10 9 Bytes How to deal with these volumes? You map the sensor readouts to particle movement vectors 2 particles moved through the following sensors (Map Reduce logic built in hardware for level 1 trigger) 5

Atlas Experiment at LHC Do you have a machine park creating status data? All it is used for is some green lights in the control room? You could store the data and try to find patterns In the past that would have been too expensive And too much data to store And too much data to process And too difficult to write software finding these patterns 6

Another example: Web Logs Google 1,470,000,000 visits a day Facebook 952,000,000 visits per day Amazon 153,000,000 visits per day One page view consists of many elements downloaded, images, scripts, style sheets The useful information is buried Raw data: main URL, date, user 7

Another example: Web Logs With that we can do statistics like select main_url, count(distinct user) from web_log group by main_url So Hadoop is just another database then! Yes, parallel processing in databases does use a similar approach Yes, there is a simplified SQL add-on available for Hadoop called Hive No, Hadoop is a programming environment No, you can do much more than with SQL 8

Another example: Web Logs Weblog says: One person did search for Nikon D700 Next page was the D700 product Then the D800 product page Then the reviews Finally the buy page Weblog implicitly says: There was a 5 second time delay between D700 page to D800 page he did not know about the D800 On the D700 page he clicked on the There is a newer model of this item 9

Another example: Web Logs Much more knowledge hidden Should we promote the D800? Should we send an email to all D700 buyers? Was he an informed user or did he read reviews longer? How good was the search result? Did he consider a Canon camera? Did the reviews lead to a buy decision? And turning the question upside down How often did a search on D700 end in a D800 buy? Do we have too many negative reviews to impact sales or all fine? 10

Another example: Web Logs Do you have a web page for your company? Analyze the web shop user patterns Analyze the support activities Analyze the company presence web page patterns 11

Map Reduce semantic Map takes a (key1, data1) pair and outputs zero to many new output pairs (key2, data2) Example: find all log lines of a given URL Input: Key1: Line number Data1: web log line of text Map logic: If Data1 contains URL then Else Key2: URL Data2: constant 1 nothing 12

Map Reduce semantic Reduce gets (key2, array of <data2>) and outputs (key3, value3) Example: Count the URLs returned by the previous Map example (key2..url, data2..constant 1) Input: Key2: URL Array of data2: <1,1,1,1,1,1> Reduce Logic: Key3: URL Data3: 1+1+1+1+1+1=6 13

Map Reduce semantic Three advantages of MapReduce It is simple to understand It can be parallelized Many data problems can be formalized as MapReduce Disadvantages It is not split-second It is an IT task to build logic It is not a general purpose semantic 14

Map Reduce semantic See the following page for examples http://highlyscalable.wordpress.com/2012/02/01/mapreducepatterns/ Sort, join, aggregation, and other SQL like operations mathematical topology problems 15

What is Hadoop? Hadoop is software that executed MapReduce tasks and provides the required supporting environment It allows to distribute the task to as many servers as available Hence each server needs to have access to data clustered file system The more servers used the more likely one will fail during execution Fault tolerance of the file system Fault tolerance of the MapReduce engine Distribute the MapReduce logic of the failed server to the others but neither stop the job nor rerun all 16

How to use Hadoop more easily? Currently it is all about simplifying the Hadoop user experience PIG Latin: A precompiler language input_lines = LOAD '/tmp/lots_of_text' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; groups = GROUP words BY word; word_count = FOREACH groups GENERATE COUNT(words) AS count, group AS word; STORE word_count INTO '/tmp/words_used'; HiveQL: A SQL like addon LOAD DATA LOCAL INPATH './weblog.csv' OVERWRITE INTO TABLE weblog; INSERT OVERWRITE TABLE urls SELECT a.url, count(*) FROM weblog a GROUP BY a.url; 17

How to use Hadoop more easily? SAP DataServices is an ETL tool We can read and load Hadoop (and many other databases ) We can pushdown logic to all databases In case of Hadoop we can pushdown SQL like operations by utilizing Hive addon PIG scripts the customer has written Our TextDataProcessing Transform 18

How to use Hadoop more easily? Hadoop Hive is a SQL source in Business Objects BI tools, see SAP BusinessObjects BI 4.0 FP3 on Apache Hadoop Hive Tuesday, September 11, 2012 4:00 PM - 5:00 PM 19

What to use Hadoop for Hadoop as a huge disk array How would you implement a disk array with 100 disks? Buy multiple NAS systems Connect multiple large disk arrays to one computer Or you buy 20 PC class computers with 6 disks each A fraction of the costs Thanks to HDFS it will look like one large disk If one computer dies or is not reachable the system is not impacted Why? Online archives you can query data from Keep data you delete or do not even collect today Database contains the structured data, Hadoop the unstructured Database contains the measures, Hadoop the raw data 20

What to use Hadoop for Hadoop as a Data Warehouse database How do you build a DWH database today? Classic RDBMS on a large server Higher maintenance costs No fault tolerance Or setup a Hadoop cluster with a few commodity PC class computers A fraction of the cost No transaction support though but not needed either in DWH Addons like Hive, Cassandra, HBase make Hadoop look like a database No ANSI SQL! So your BI tool needs to support each Why? Good for reporting kind of applications And to convert the raw data into measures loaded into the RDBMS 21

What to use Hadoop for Hadoop as high performance computing system Massively parallel systems are specialized systems still Weather simulation, crash simulations, image recognition Data Mining, Neuronal Networks Or you use Hadoop for a subset of above Not all can be expressed as MapReduce logic When you have multiple independent sources: Millions of customer reviews, perform text analysis on each When many independent calculations are done on the same data: Calculate different routes for the truck and pick the fastest Why? In depth statistics, Data Mining Machine Learning (see Mahout for automatic clustering etc) 22

What to use Hadoop for Watch-out for legal constraints In Europe you are not allowed to store all data you have access to, only data needed for the benefit of the customer Even crawling external forums is questionable In US you are sensitive about racism and religious discrimination Watch-out for ethical questionable things So many companies are in the news because of unethical actions Not everything that is true needs to be made public Not everything that helps the company make more money short term has a positive effect longer term 23

Enough talking, show me! The Mapper Class gets a Long as key and Text as value Returns Text as key and Integer as value Map one sentence to many (word, 1) tuples public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); } public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); output.collect(word, one); } } 24

Enough talking, show me! The Reducer gets Text as key and an Array of 1s as values And returns Text and a number For each word identified by the Map summarize the 1s to get the overall count public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } 25

Enough talking, show me! The main program defines the input/output format The classes, the input and output file And starts the job public void run(string inputpath, String outputpath) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); // the keys are words (strings) conf.setoutputkeyclass(text.class); // the values are counts (ints) conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(mapclass.class); conf.setreducerclass(reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); } JobClient.runJob(conf); 26

Key Learnings Demystify Hadoop it is not the answer to all questions Do not underestimate Hadoop it enables you to have your own supercomputer Use cases range from simple storage of data to machine learning Formulate your query as MapReduce for the Hadoop engine What is Scott Adam s www.dilbert.com saying? 27

Thank you for participating. Please provide feedback on this session by completing a short survey via the event mobile application. SESSION CODE: 0202 Learn more year-round at www.asug.com