This material is built based on, Patterns covered in this class FILTERING PATTERNS. Filtering pattern



Similar documents
Extreme Computing. Hadoop MapReduce in more detail.

Getting to know Apache Hadoop

Big Data and Scripting map/reduce in Hadoop

Chulalongkorn University International School of Engineering Department of Computer Engineering Computer Programming Lab.

Word Count Code using MR2 Classes and API

Xiaoming Gao Hui Li Thilina Gunarathne

Introduction to MapReduce and Hadoop

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Data Science in the Wild

Hadoop Configuration and First Examples

Mrs: MapReduce for Scientific Computing in Python

5 HDFS - Hadoop Distributed System

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Hadoop WordCount Explained! IT332 Distributed Systems

CSE 1223: Introduction to Computer Programming in Java Chapter 7 File I/O

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

BIG DATA, MAPREDUCE & HADOOP

Internals of Hadoop Application Framework and Distributed File System

Creating a Simple, Multithreaded Chat System with Java

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Running Hadoop on Windows CCNP Server

Project 5 Twitter Analyzer Due: Fri :59:59 pm

HPCHadoop: MapReduce on Cray X-series

map/reduce connected components

Hadoop Design and k-means Clustering

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Map Reduce Workflows

Hadoop Basics with InfoSphere BigInsights

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Big Data for the JVM developer. Costin Leau,

Big Data & Scripting Part II Streaming Algorithms

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

13 File Output and Input

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Map-Reduce and Hadoop

Tutorial- Counting Words in File(s) using MapReduce

!"# $ %!&' # &! ())*!$

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data Management and NoSQL Databases

Scalable Computing with Hadoop

Big Data and Apache Hadoop s MapReduce

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Introduction To Hadoop

Data Science Analytics & Research Centre

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Kafka & Redis for Big Data Solutions

Data Intensive Computing Handout 6 Hadoop

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Distributed Apriori in Hadoop MapReduce Framework

Data Intensive Computing Handout 5 Hadoop

MR-(Mapreduce Programming Language)

Data Structures for Big Data: Bloom Filter. Vinicius Vielmo Cogo Smalltalks, DI, FC/UL. October 16, 2014.

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Streaming. Table of contents

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Basic Programming and PC Skills: Basic Programming and PC Skills:

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science

Introduction to Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST II

Explain the relationship between a class and an object. Which is general and which is specific?

AP Computer Science Java Subset

JDK 1.5 Updates for Introduction to Java Programming with SUN ONE Studio 4

BIG DATA APPLICATIONS

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Introduction to Hadoop

Introduction to Java

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

CS506 Web Design and Development Solved Online Quiz No. 01

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Word count example Abdalrahman Alsaedi

Evaluation. Copy. Evaluation Copy. Chapter 7: Using JDBC with Spring. 1) A Simpler Approach ) The JdbcTemplate. Class...

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.

Handout 3 cs180 - Programming Fundamentals Spring 15 Page 1 of 6. Handout 3. Strings and String Class. Input/Output with JOptionPane.

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Part I. Multiple Choice Questions (2 points each):

Transcription:

2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 1 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 2 PART 0. INTRODUCTION TO BIG DATA PART 1. MAPREDUCE AND THE NEW SOFTWARE STACK 1. DISTRIBUTED FILE SYSTEM 2. MAP REDUCE COMPUTING FRAMEWORK 3. FINDING SIMILAR ITEMS AT SCALE 4. LINK ANALYSIS 5. FILTERING PATTERNS CS480 A2 INTRODUCTION TO BIG DATA This material is built based on, MapReduce Design Patterns Building Effective Algorithms and Analytics for Hadoop and Other Systems By Donald Miner, Adam Shook November, 2012 PART 1. MAPREDUCE AND THE NEW SOFTWARE STACK 5. FILTERING PATTERNS Computer Science, Colorado State University http://www.cs.colostate.edu/~cs480 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 3 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 4 Patterns covered in this class 1. Simple Random Sampling 2. Bloom filter 3. Top 10 4. Distinct FILTERING PATTERNS 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 5 ing pattern Providing an abstract of existing data Many data filtering do not require the reduce part of MapReduce It does not produce an aggregation 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 6 ing Pattern 1. Simple Random Sampling Each record has an equal probability of being selected Useful for sizing down a data set For representative analysis Known uses Tracking a thread of events Distributed grep Data cleaning Closer view of data Simple random sampling Removing low scoring data Part 1-5 1

2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 7 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 8 The structure of the simple filter pattern Writing a Simple Random Sampling filter public static class SRS extends < Object, Text, NullWritable, Text > { private Random rands = new Random(); private Double percentage; protected void setup(context context) throws IOException, InterruptedException { // Retrieve the percentage that is passed in via the configuration // like this: conf.set(" filter_percentage",.5); // for.5% String strpercentage = context.getconfiguration().get("filter_percentage"); percentage = Double.parseDouble( strpercentage) / 100.0; public void map( Object key, Text value, Context context) throws IOException, InterruptedException { if (rands. nextdouble() < percentage) { context.write (NullWritable.get(), value); 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 9 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 10 ing Pattern 2. Bloom ing Checking the membership of a set Known uses Removing most of the non-membership values Prefiltering a data set for an expensive set membership check What is a Bloom? Burton Howard Bloom in 1970 Probabilistic data structure used to test whether a member is an element of a set Strong space advantage 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 11 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 12 m n The number of bits in the filter The number of members in the set p The desired false positive rate k The number of different hash functions used to map some element to one of the m bits with a uniform random distribution m = 8 The number of bits in the filter n = 3 The number of members in the set T={5, 10,15} Initial bloom filter 0 0 0 0 0 0 0 0 Part 1-5 2

2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 13 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 14 m = 8, n = 3 target set T = { 5, 10, 15 } h1(5) = 7, h2(5) = 5, h3(5) = 5 Initial bloom filter 0 0 0 0 0 0 0 0 m = 8, n = 3 target set T = { 5, 10, 15 } h1(10) = 6, h2(10) = 7, h3(10) = 2 After h3(5) = 5 1 0 1 0 0 0 0 0 After h1(5) = 7 1 0 0 0 0 0 0 0 After h1(10) = 6 1 1 1 0 0 0 0 0 After h2(5) = 5 1 0 1 0 0 0 0 0 After h2(10) = 7 1 1 1 0 0 0 0 0 After h3(5) = 5 1 0 1 0 0 0 0 0 After h3(10) = 2 1 1 1 0 0 1 0 0 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 15 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 16 m = 8, n = 3 target set T = { 5, 10, 15 } h1(15) = 6, h2(15) = 7, h3(15) =7 Is 5 part of set T? h1(5), h2(5), h3(5) th bits are 1 5 is probably a part of set T After h3(10) = 2 1 1 1 0 0 1 0 0 After h1(15) = 5 1 1 1 0 0 1 0 0 Check h1(5) = 7 1 1 1 0 0 1 0 0 After h2(15) = 1 1 1 1 0 0 1 1 0 Check h2(5) = 5 1 1 1 0 0 1 1 0 After h3(15) = 7 1 1 1 0 0 1 1 0 Check h3(5) = 5 1 1 1 0 0 1 1 0 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 17 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 18 Is 8 part of set T? h1(8), h2(8), h3(8) 8 is NOT a part of set T Is 9 part of set T? h1(9), h2(9), h3(9) 9 is NOT a part of set T Check h1(8) = 0 1 1 1 0 0 1 0 0 Check h1(9) = 3 1 1 1 0 0 1 0 0 Check h2(8) = 3 1 1 1 0 0 1 1 0 Check h2(9) = 5 1 1 1 0 0 1 1 0 Check h3(8) = 0 1 1 1 0 0 1 1 0 Check h3(9) = 1 1 1 1 0 0 1 1 0 Part 1-5 3

2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 19 9/22/14 CS535/581 Big Data - Fall 2014 20 Is 7 part of set T? h1(7), h2(7), h3(7) th bits are 1 7 is probably a part of set T Hash functions k Hash functions Uniform random distribution in [1..m) Cryptographic hash functions MD5, SHA-1, SHA-256, Tiger, Whirlpool.. Murmur Hashes (non-cryptographic) Check h1(7) = 7 1 1 1 0 0 1 0 0 Check h2(7) = 1 1 1 1 0 0 1 1 0 Check h3(7) = 7 1 1 1 0 0 1 1 0 9/22/14 CS535/581 Big Data - Fall 2014 21 9/22/14 CS535/581 Big Data - Fall 2014 22 False positive rate (1/2) fpr = (1 (1 1 m )kn ) k (1 e kn/m ) k m=number of bits in the filter n=number of elements k=number of hashing functions False positive rate (2/2) A bloom filter with an optimal value for k and 1% error rate only needs 9.6 bits per key. Add 4.8 bits/key and the error rate decreases by 10 times 10,000 words with 1% error rate and 7 hash functions ~12KB of memory 10,000 words with 0.1% error rate and 11 hash functions ~18KB of memory 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 23 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 24 Use cases FAQs Representing a very large dataset Reduce queries to external database Google BigTable Part 1-5 4

2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 25 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 26 Downsides False positive rate Hard to remove elements from a Bloom filter set Setting bits to zero Often more than one element hashed to a particular bits Use a Counting Bloom filter Instead of bit, it stores count of occurrences Requires more memory Calculating optimal m with an optimal k n ln(p) m = ln(2) 2 public static int getoptimalbloomsize(int numelements, float falseposrate) { return (int) (-numelements * (float) Math.log( falseposrate) / Math.pow( Math.log( 2), 2)); 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 27 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 28 Calculating optimal k The structure of the Bloom filter pattern k = m ln(2) n Bloom Building Building Bloom filter public static int getoptimalk( float numelements, float vectorsize) { return (int) Math.round( vectorsize * Math.log( 2) /numelements); Load filter Bloom filter test Bloom Maybe No File Discarded Load filter Bloom filter test Maybe No File Bloom Discarded 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 29 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 30 Writing a Bloom filter 1/2 public class BloomDriver { public static void main( String[] args) throws Exception { / / Parse command line arguments Path inputfile = new Path( args[ 0]); int nummembers = Integer.parseInt( args[ 1]); float falseposrate = Float.parseFloat( args[2]); Path bffile = new Path( args[3]); // Calculate our vector size and optimal K value based on approximations int vectorsize = getoptimalbloomsize( nummembers, falseposrate); int nbhash = getoptimalk( nummembers, vectorsize); // Create new Bloom filter Bloom filter = new Bloom( vectorsize, nbhash, Hash.MURMUR_HASH); // Open file for read String line = null; int numelements = 0; FileSystem fs = FileSystem.get( new Configuration()); Writing a Bloom filter 2/2 for (FileStatus status : fs.liststatus( inputfile)) { BufferedReader rdr = new BufferedReader( new StreamReader( new GZIPStream( fs.open( status.getpath())))); System.out.println("Reading " + status.getpath()); while (( line = rdr.readline()) = null) { filter.add( new Key( line.getbytes())); + + numelements; rdr.close(); System.out.println("Trained Bloom filter with " + numelements + " entries."); System.out.println("Serializing Bloom filter to HDFS at " + bffile); FSDataStream strm = fs.create( bffile); filter.write( strm); strm.flush(); strm.close(); System.exit( 0); Part 1-5 5

2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 31 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 32 Writing a Bloom mapper (1/2) public static class Blooming extends < Object, Text, Text, NullWritable > { private Bloom filter = new Bloom(); protected void setup( Context context) throws IOException, InterruptedException { // Get file from the DistributedCache URI[] files=distributedcache.getcachefiles( context.getconfiguration()); System.out.println( "Reading Bloom filter from: " + files[0]. getpath()); // Open local file for read. DataStream strm = new DataStream( new FileStream(files[ 0]. getpath())); // Read into our Bloom filter. filter.readfields(strm); strm.close(); Writing a Bloom mapper (2/2) public void map( Object key, Text value, Context context) throws IOException, InterruptedException { Map < String, String > parsed=transformxmltomap(value.tostring()); // Get the value for the comment String comment = parsed.get("text"); StringTokenizer tokenizer = new StringTokenizer(comment); // For each word in the comment while (tokenizer. hasmoretokens()) { // If the word is in the filter, // output the record and break String word = tokenizer.nexttoken(); if (filter. membershiptest( new Key( word.getbytes()))) { context.write(value, NullWritable.get()); break; 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 33 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 34 ing Pattern 3. Top 10 The structure of Top 10 pattern Retrieves a relatively small number (top K) of records, according to a ranking scheme in your dataset, no matter how large the data Known uses Outlier analysis Selecting interesting data Catchy dashboards TopTen Reducer TopTen output 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 35 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 36 public static class TopTen extends < Object, Text, NullWritable, Text > { // Stores a map of user reputation to the record private TreeMap < Integer, Text > reptorecordmap = new TreeMap < Integer, Text >(); public void map( Object key, Text value, Context context) throws IOException, InterruptedException { Map < String, String > parsed = transformxmltomap( value.tostring()); String userid = parsed.get("id"); String reputation = parsed.get("reputation"); // Add this record to our map with the reputation as the key reptorecordmap.put( Integer.parseInt( reputation), new Text( value)); // If we have more than ten records, remove the one with the lowest rep // As this tree map is sorted in descending order, the user with // the lowest reputation is the last key. if (reptorecordmap.size() > 10) { reptorecordmap.remove( reptorecordmap.firstkey()); protected void cleanup( Context context) throws IOException, InterruptedException { // our ten records to the reducers with a null key for (Text t : reptorecordmap.values()) { context.write(nullwritable.get(), t); } } Reducer public static class TopTenReducer extends Reducer < NullWritable, Text, NullWritable, Text > { // Stores a map of user reputation to the record // Overloads the comparator to order the reputations in descending order private TreeMap < Integer, Text > reptorecordmap = new TreeMap < Integer, Text >(); public void reduce( NullWritable key, Iterable < Text > values, Context context) throws IOException, InterruptedException { for (Text value : values) { Map < String, String > parsed = transformxmltomap( value.tostring()); reptorecordmap.put( Integer.parseInt( parsed.get("reputation")), new Text( value)); // If we have more than ten records, remove the one with the lowest rep // As this tree map is sorted in descending order, the user with // the lowest reputation is the last key. if (reptorecordmap. size() > 10) { reptorecordmap.remove( reptorecordmap.firstkey()); for (Text t : reptorecordmap.descendingmap(). values()) { // our ten records to the file system with a null key context.write( NullWritable.get(), t); Part 1-5 6

2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 37 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 38 ing Pattern 4. Distinct You have data that contains similar records and you want to find a unique set of values Code public static class DistinctUser extends < Object, Text, Text, NullWritable > { private Text outuserid = new Text(); public void map( Object key, Text value, Context context) throws IOException, InterruptedException { Map < String, String > parsed = transformxmltomap( value.tostring()); // Get the value for the UserId attribute String userid = parsed.get("userid"); // Set our output key to the user's id outuserid.set( userid); // Write the user's id with a null value context.write( outuserid, NullWritable.get()); 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 39 2/23/15 CS480 A2 Introduction to Big Data - Spring 2015 40 Reducer code public static class DistinctUserReducer extends Reducer < Text, NullWritable, Text, NullWritable > { public void reduce( Text key, Iterable < NullWritable > values, Context context) throws IOException, InterruptedException { // Write the user's id with a null value context.write(key, NullWritable.get()); Combiner How can you improve the performance using a Combiner? Part 1-5 7