September 10-13, 2012 Orlando, Florida. Another Buzz Word - Hadoop! Or is That Something a Regular Person Can Use?

Similar documents
CS54100: Database Systems

Hadoop WordCount Explained! IT332 Distributed Systems

Word count example Abdalrahman Alsaedi

Introduction To Hadoop

MR-(Mapreduce Programming Language)

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Biomap Jobs and the Big Picture

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Introduction to MapReduce and Hadoop

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Data Science in the Wild

How To Write A Mapreduce Program In Java.Io (Orchestra)

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Hadoop: Understanding the Big Data Processing Method

Big Data for the JVM developer. Costin Leau,

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Hadoop + Clojure. Hadoop World NYC Friday, October 2, Stuart Sierra, AltLaw.org

Introduc8on to Apache Spark

Connecting Hadoop with Oracle Database

HPCHadoop: MapReduce on Cray X-series

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

map/reduce connected components

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1

Programming in Hadoop Programming, Tuning & Debugging

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Hadoop Configuration and First Examples

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Xiaoming Gao Hui Li Thilina Gunarathne

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Big Data Management and NoSQL Databases

Introduction to MapReduce and Hadoop

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Internals of Hadoop Application Framework and Distributed File System

Introduction to Cloud Computing

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop, Hive & Spark Tutorial

ITG Software Engineering

Big Data and Apache Hadoop s MapReduce

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop and Map-Reduce. Swati Gore

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Word Count Code using MR2 Classes and API

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Open source Google-style large scale data analysis with Hadoop

Hadoop IST 734 SS CHUNG

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Cloud Computing using MapReduce, Hadoop, Spark

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Big Data Workshop. dattamsha.com

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Journal of Environmental Science, Computer Science and Engineering & Technology

BIG DATA, MAPREDUCE & HADOOP

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

BIG DATA ANALYTICS HADOOP PERFORMANCE ANALYSIS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Map Reduce a Programming Model for Cloud Computing Based On Hadoop Ecosystem

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Big Data Analytics* Outline. Issues. Big Data

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Getting to know Apache Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Implement Hadoop jobs to extract business value from large and varied data sets

LANGUAGES FOR HADOOP: PIG & HIVE

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

Using distributed technologies to analyze Big Data

Lecture #1. An overview of Big Data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Petabyte-scale Data with Apache HDFS. Matt Foley Hortonworks, Inc.

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

How To Scale Out Of A Nosql Database

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Infrastructures for big data

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Log Mining Based on Hadoop s Map and Reduce Technique

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data on Microsoft Platform

Map Reduce & Hadoop Recommended Text:

Jeffrey D. Ullman slides. MapReduce for data intensive computing

BIG DATA TRENDS AND TECHNOLOGIES

Data processing goes big

Introduction to Big Data Training

Open source large scale distributed data management with Google s MapReduce and Bigtable

Cloud Computing Era. Trend Micro

Big Data and Scripting Systems build on top of Hadoop

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

This article is the second

Transcription:

September 10-13, 2012 Orlando, Florida Another Buzz Word - Hadoop! Or is That Something a Regular Person Can Use?

Learning Points What is Hadoop? What is Map Reduce? Ideas on when to use it 2

Hadoop Hadoop & Big Data are buzz words and as such used by many for many reasons Hadoop is actually three things in one A clustered file system with fault tolerance A Map-Reduce execution engine An infrastructure to allow parallel execution in clusters Apache Hadoop is the first system enabling massively parallel computing in a cheap way to average companies What do you need a supercomputer for? 3

Atlas Experiment at the LHC LHC has a 26 659m ring perimeter Light speed is 300 000 000 m/s hence One packet revolves 11 000 times/sec 2800 packets simultaneously in the ring hence up to 40 million collisions per second 4

Atlas Experiment at LHC Results is 40 000 000 measurements per second Atlas has 150 000 000 electric connectors 40 * 10 6 * 150 * 10 6 = 6 * 10 15 values per second Note: 1TB = 1 * 10 9 Bytes How to deal with these volumes? You map the sensor readouts to particle movement vectors 2 particles moved through the following sensors (Map Reduce logic built in hardware for level 1 trigger) 5

Atlas Experiment at LHC Do you have a machine park creating status data? All it is used for is some green lights in the control room? You could store the data and try to find patterns In the past that would have been too expensive And too much data to store And too much data to process And too difficult to write software finding these patterns 6

Another example: Web Logs Google 1,470,000,000 visits a day Facebook 952,000,000 visits per day Amazon 153,000,000 visits per day One page view consists of many elements downloaded, images, scripts, style sheets The useful information is buried Raw data: main URL, date, user 7

Another example: Web Logs With that we can do statistics like select main_url, count(distinct user) from web_log group by main_url So Hadoop is just another database then! Yes, parallel processing in databases does use a similar approach Yes, there is a simplified SQL add-on available for Hadoop called Hive No, Hadoop is a programming environment No, you can do much more than with SQL 8

Another example: Web Logs Weblog says: One person did search for Nikon D700 Next page was the D700 product Then the D800 product page Then the reviews Finally the buy page Weblog implicitly says: There was a 5 second time delay between D700 page to D800 page he did not know about the D800 On the D700 page he clicked on the There is a newer model of this item 9

Another example: Web Logs Much more knowledge hidden Should we promote the D800? Should we send an email to all D700 buyers? Was he an informed user or did he read reviews longer? How good was the search result? Did he consider a Canon camera? Did the reviews lead to a buy decision? And turning the question upside down How often did a search on D700 end in a D800 buy? Do we have too many negative reviews to impact sales or all fine? 10

Another example: Web Logs Do you have a web page for your company? Analyze the web shop user patterns Analyze the support activities Analyze the company presence web page patterns 11

Map Reduce semantic Map takes a (key1, data1) pair and outputs zero to many new output pairs (key2, data2) Example: find all log lines of a given URL Input: Key1: Line number Data1: web log line of text Map logic: If Data1 contains URL then Else Key2: URL Data2: constant 1 nothing 12

Map Reduce semantic Reduce gets (key2, array of <data2>) and outputs (key3, value3) Example: Count the URLs returned by the previous Map example (key2..url, data2..constant 1) Input: Key2: URL Array of data2: <1,1,1,1,1,1> Reduce Logic: Key3: URL Data3: 1+1+1+1+1+1=6 13

Map Reduce semantic Three advantages of MapReduce It is simple to understand It can be parallelized Many data problems can be formalized as MapReduce Disadvantages It is not split-second It is an IT task to build logic It is not a general purpose semantic 14

Map Reduce semantic See the following page for examples http://highlyscalable.wordpress.com/2012/02/01/mapreducepatterns/ Sort, join, aggregation, and other SQL like operations mathematical topology problems 15

What is Hadoop? Hadoop is software that executed MapReduce tasks and provides the required supporting environment It allows to distribute the task to as many servers as available Hence each server needs to have access to data clustered file system The more servers used the more likely one will fail during execution Fault tolerance of the file system Fault tolerance of the MapReduce engine Distribute the MapReduce logic of the failed server to the others but neither stop the job nor rerun all 16

How to use Hadoop more easily? Currently it is all about simplifying the Hadoop user experience PIG Latin: A precompiler language input_lines = LOAD '/tmp/lots_of_text' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; groups = GROUP words BY word; word_count = FOREACH groups GENERATE COUNT(words) AS count, group AS word; STORE word_count INTO '/tmp/words_used'; HiveQL: A SQL like addon LOAD DATA LOCAL INPATH './weblog.csv' OVERWRITE INTO TABLE weblog; INSERT OVERWRITE TABLE urls SELECT a.url, count(*) FROM weblog a GROUP BY a.url; 17

How to use Hadoop more easily? SAP DataServices is an ETL tool We can read and load Hadoop (and many other databases ) We can pushdown logic to all databases In case of Hadoop we can pushdown SQL like operations by utilizing Hive addon PIG scripts the customer has written Our TextDataProcessing Transform 18

How to use Hadoop more easily? Hadoop Hive is a SQL source in Business Objects BI tools, see SAP BusinessObjects BI 4.0 FP3 on Apache Hadoop Hive Tuesday, September 11, 2012 4:00 PM - 5:00 PM 19

What to use Hadoop for Hadoop as a huge disk array How would you implement a disk array with 100 disks? Buy multiple NAS systems Connect multiple large disk arrays to one computer Or you buy 20 PC class computers with 6 disks each A fraction of the costs Thanks to HDFS it will look like one large disk If one computer dies or is not reachable the system is not impacted Why? Online archives you can query data from Keep data you delete or do not even collect today Database contains the structured data, Hadoop the unstructured Database contains the measures, Hadoop the raw data 20

What to use Hadoop for Hadoop as a Data Warehouse database How do you build a DWH database today? Classic RDBMS on a large server Higher maintenance costs No fault tolerance Or setup a Hadoop cluster with a few commodity PC class computers A fraction of the cost No transaction support though but not needed either in DWH Addons like Hive, Cassandra, HBase make Hadoop look like a database No ANSI SQL! So your BI tool needs to support each Why? Good for reporting kind of applications And to convert the raw data into measures loaded into the RDBMS 21

What to use Hadoop for Hadoop as high performance computing system Massively parallel systems are specialized systems still Weather simulation, crash simulations, image recognition Data Mining, Neuronal Networks Or you use Hadoop for a subset of above Not all can be expressed as MapReduce logic When you have multiple independent sources: Millions of customer reviews, perform text analysis on each When many independent calculations are done on the same data: Calculate different routes for the truck and pick the fastest Why? In depth statistics, Data Mining Machine Learning (see Mahout for automatic clustering etc) 22

What to use Hadoop for Watch-out for legal constraints In Europe you are not allowed to store all data you have access to, only data needed for the benefit of the customer Even crawling external forums is questionable In US you are sensitive about racism and religious discrimination Watch-out for ethical questionable things So many companies are in the news because of unethical actions Not everything that is true needs to be made public Not everything that helps the company make more money short term has a positive effect longer term 23

Enough talking, show me! The Mapper Class gets a Long as key and Text as value Returns Text as key and Integer as value Map one sentence to many (word, 1) tuples public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); } public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); output.collect(word, one); } } 24

Enough talking, show me! The Reducer gets Text as key and an Array of 1s as values And returns Text and a number For each word identified by the Map summarize the 1s to get the overall count public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } 25

Enough talking, show me! The main program defines the input/output format The classes, the input and output file And starts the job public void run(string inputpath, String outputpath) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); // the keys are words (strings) conf.setoutputkeyclass(text.class); // the values are counts (ints) conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(mapclass.class); conf.setreducerclass(reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); } JobClient.runJob(conf); 26

Key Learnings Demystify Hadoop it is not the answer to all questions Do not underestimate Hadoop it enables you to have your own supercomputer Use cases range from simple storage of data to machine learning Formulate your query as MapReduce for the Hadoop engine What is Scott Adam s www.dilbert.com saying? 27

Thank you for participating. Please provide feedback on this session by completing a short survey via the event mobile application. SESSION CODE: 0202 Learn more year-round at www.asug.com