Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0

Similar documents
ITG Software Engineering

MapReduce. Tushar B. Kute,

To reduce or not to reduce, that is the question

Hadoop Job Oriented Training Agenda

Map Reduce & Hadoop Recommended Text:

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Hadoop Setup. 1 Cluster

How To Install Hadoop From Apa Hadoop To (Hadoop)

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Integrating VoltDB with Hadoop

Peers Techno log ies Pv t. L td. HADOOP

Hands-on Exercises with Big Data

COURSE CONTENT Big Data and Hadoop Training

Running Knn Spark on EC2 Documentation

Basic Hadoop Programming Skills

Handling Big(ger) Logs: Connecting ProM 6 to Apache Hadoop

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

HiBench Installation. Sunil Raiyani, Jayam Modi

TP1: Getting Started with Hadoop

Distributed Framework for Data Mining As a Service on Private Cloud

Complete Java Classes Hadoop Syllabus Contact No:

Qsoft Inc

Implement Hadoop jobs to extract business value from large and varied data sets

Click Stream Data Analysis Using Hadoop

INSTALLING MALTED 3.0 IN LINUX MALTED: INSTALLING THE SYSTEM IN LINUX. Installing Malted 3.0 in LINUX

Performance Analysis of Book Recommendation System on Hadoop Platform

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Big Data Course Highlights

HDFS Cluster Installation Automation for TupleWare

HADOOP. Revised 10/19/2015

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

HADOOP. Installation and Deployment of a Single Node on a Linux System. Presented by: Liv Nguekap And Garrett Poppe

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

American International Journal of Research in Science, Technology, Engineering & Mathematics

A. Aiken & K. Olukotun PA3

SparkLab May 2015 An Introduction to

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)


Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

A Performance Analysis of Distributed Indexing using Terrier

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Single Node Setup. Table of contents

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Data Analyst Program- 0 to 100

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

H2O on Hadoop. September 30,

Data processing goes big

Processing Large Amounts of Images on Hadoop with OpenCV

File S1: Supplementary Information of CloudDOE

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

International Journal of Advance Research in Computer Science and Management Studies

Installing (1.8.7) 9/2/ Installing jgrasp

CS2510 Computer Operating Systems Hadoop Examples Guide

Big Data and Scripting map/reduce in Hadoop

CDH installation & Application Test Report

A REVIEW ON EFFICIENT DATA ANALYSIS FRAMEWORK FOR INCREASING THROUGHPUT IN BIG DATA. Technology, Coimbatore. Engineering and Technology, Coimbatore.

map/reduce connected components

Introduction to Cloud Computing

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Vaidya Guide. Table of contents

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Hadoop Tutorial. General Instructions

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Big Data and Scripting Systems build on top of Hadoop

Project 5 Twitter Analyzer Due: Fri :59:59 pm

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

CASE STUDY OF HIVE USING HADOOP 1

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

unisys ClearPath Servers Hadoop Distributed File System(HDFS) Data Transfer Guide Firmware 2.0 and Higher December

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Big Data and Analytics: Challenges and Opportunities

Load-Balancing the Distance Computations in Record Linkage

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Data storing and data access

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

BIRT Application and BIRT Report Deployment Functional Specification

IDS 561 Big data analytics Assignment 1

TIBCO Runtime Agent Authentication API User s Guide. Software Release November 2012

Testing Automation for Distributed Applications By Isabel Drost-Fromm, Software Engineer, Elastic

ITG Software Engineering

Cloudera Certified Developer for Apache Hadoop

Transcription:

Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0 Vahid Jalali David Leake August 9, 2015 Abstract BEAR is a case-based regression learner tailored for big data processing. It works by applying ensemble of adaptations for adjusting instancebased estimations where adaptations are generated by comparing pairs of cases in the local neighborhood of the input query. Compared to other case-based regression systems, BEAR specifically speeds up source case retrieval and adaptation generation by utilizing locality sensitive hashing for case retrieval, enabling it to be applied on case bases with several millions of cases on a relatively small cluster. This document reviews BEAR s foundation and algorithm and provides step by step instructions for building and using BEAR in practice. c 2015 Indiana University, Bloomington, Indiana This manual is licensed under the GNU General Public License version 3. More information about this license can be found at http://www.gnu.org/licenses/gpl-3.0-standalone.html

1 Introduction This document provides a brief overview of BEAR (Big data Ensembles of Adaptations for Regression) [1], a method that utilizes big data techniques to scale up automatic knowledge generation with EAR4 [2], a lazy learning method that wokrs by applying an ensemble of adaptations for regression, and provides step by step instructions for building and using the source code of the BEAR method. BEAR builds on top of two other methods, EAR4 and locality sensitive hashing (LSH) [3]. BEAR s source code uses the implementation of EAR4 s Weka pluging [4] and Apache DataFu [5], to support case-based regression and locality sensitive hashing respectively. The corresponding class from Apache DataFu used in BEAR s implementation is L2P StableHash, with a 2-stable distribution and default parameter settings. 2 Background 2.1 EAR4 EAR4 is a case-based regression method that works by applying ensmeble of adapations for adjusting the instance-based predictions. The generic process of case-based regression is illustrated in Fig. 1. Given a new problem, case-based regression generates a solution by retrieving a set of cases for similar problems, adjusting the solution values of the retrieved cases (which we will call source cases), and then combining the adjusted values to generate the final predicted value. For more information about case-based regression and EAR4 refer to EAR4 s manual [4]. CombineVals A A A = case i = problem description of = solution of Q = query Q A adapts the pair (case, query) to adjust the case s solution to the query = Combination of adapted values, function({a(,q), A(,Q), A(,Q)}) Figure 1: Illustration of the generic case-based regression process [6] 1

2.2 Locality Sensitive Hashing Locality Sensitive Hashing (LSH) is a hashing method for decreasing the retrieval time of similar cases in d-dimensional Euclidean space. LSH achieves this goal by applying a set of hashing functions for which the probability of collision for similar cases is higher. Therefore, in contrast to nearest neighbor search, LSH does not require comparing a case with other cases to find its nearest neighbors which results in lower time complexity of the retrieval process. BEAR builds on an implementation of LSH for MapReduce framework. MapReduce is a framework that enables paraller execution of tasks by distributing them among different nodes in a cluster. More specifically, the utilized LSH method, uses Apache hadoop which is a popular open source implementation of MapReduce among both industrial and academic communities. 3 A Quick Sketch of BEAR s Approach BEAR is an implementation of case-based regression for big data. It utilizes LSH to retrieve nearest neighbors of the input query and adjusts the values of the nearest neighbors by applying an adaptation of ensemble rules. Ensemble rules and nearest neighbors are both extracted from the local neighborhood of the input query (determined by LSH). Adaptation rules are generated by applying Case Difference Heuristic which generates rules based on comparing pairs of cases. The overal process flow of BEAR is depicted in Fig. 2. The query is represented by square and cases in the case base are represented by circles. As it can be seen, query and cases are distributed among different reducers by applying LSH. Next, based on the rules from the local neighborhood of the query, a set of adaptation is generated which can be used to build the final estimation. For full details about BEAR s underlying process, a discussion of related work, and comparative evaluations of BEAR s performance, we refer the reader to research publications on BEAR ( [1]). 4 How to Build and Use BEAR 4.1 Downloading Resources and Building EAR4 UDF BEAR s code currently works for domains with numeric input features and target values and can be downloaded from BEAR project s home page to your local machine. After uncompression the folder you should be able to see the file structure shown in Fig. 3. 1. data folder contains sample test (mpg2.data) and tarining (mpg1.data) data for MPG domain. 2. pigs folder contains knn and BEAR implementation in Pig language. 3. shell contains the comands for compiling the Java UDF and running the pig scripts. 2

012' +,-"'.,-"' +,-"'!"5&:";,/'!"#$%"&'('!"#$%"&')'!"#$%"&'*' 3#,45,678'9"8"&,678'!"#$%"&')'!$/"'1"5' <,/$"'=-6>,678'?:8,/'17/$678' Figure 2: BEAR s process flow 4. src folder contains the implementation of EAR4 s Weka plugin as a Pig UDF in Java language. 5. jar file contains the required jars for running the pig script and compiling EAR4 s srouce code. want to compile the source code. I have also put weka.jar, datafu-pig-1.3.0-snapshot.jar, pig-0.15.0-snapshot-coreh1.jar for your convenience. However, you can download these jar files from If you want to modify the implementation of EAR4 in your application, you can follow the steps below. Otherwise, you can use the existing myudf.jar in the package and skip the remainder of this subsection. To build myudf.jar, after modifying EAR4.java in src folder compile the source by running: Listing 1: Compiling EAR4 #!/ bin/bash javac cp /path/to/weka. jar : /path/to/pig 0.15.0 SNAPSHOT core h1. jar : /path/to/hadoop 2.6.0/ share/hadoop/common/ /path/to/myudfs/ear4. java Next, build the jar file by running the following command: 3

Figure 3: BEAR s package file structure Listing 2: Building myudfs Jar #!/ bin/bash jar cf myudfs. jar /path/to/myudfs 4.2 Running BEAR BEAR implementation is adopted from knn s implementation from DataFu in Pig language. The pig script utilizes EAR4 as a UDF to apply ensemble of adaptations for adjusting nearest neighbors values. If you only want to apply LSH to knn, you can refer to L2PStableHash class source code at: L2PStableHash.java. For your convinience I have included a version of knn based on L2PStableHash s source code in BEAR s package. In order to run BEAR in local mode, follow these steps: #!/ bin/bash Listing 3: Building myudfs Jar export PIG CLASSPATH=$PIG CLASSPATH:/ path/to/weka. jar pig x local f /path/to/bear. pig param train=/path/to/mpg1. data param test=/path/to/mpg2. data If you want to run the script in a cluster, remove -x local and use the hdfs path for train and test data. 4

References [1] Jalali, V., Leake, D.: Cbr meets big data: A case study of large-scale adaptation rule generation [2] Jalali, V., Leake, D.: Extending case adaptation with automaticallygenerated ensembles of adaptation rules. In: Case-Based Reasoning Research and Development, ICCBR 2013, Berlin, Springer (2013) 188 202 [3] Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing. STOC 98, New York, NY, USA, ACM (1998) 604 613 [4] Jalali, V., Leake, D.: Manual for ear4 case-based regression with ensembles of adaptations version 1.0. (2015) [5] Hayes, M., Shah, S.: Hourglass: A library for incremental processing on hadoop. In: Big Data, 2013 IEEE International Conference on. (Oct 2013) 742 752 [6] Jalali, V., Leake, D.: Enhancing case-based regression with automaticallygenerated ensembles of adaptations. Journal of Intelligent Information Systems (2015) In press. 5