Data Mining Algorithms Parallelizing in Functional Programming Language for Execution in Cluster



Similar documents
Basic Lisp Operations

Universitetet i Linköping Institutionen för datavetenskap Anders Haraldsson

Functional Programming. Functional Programming Languages. Chapter 14. Introduction

Programming Languages in Artificial Intelligence

Hadoop and Map-reduce computing

Text Analytics using High Performance SAS Text Miner

Decision Trees from large Databases: SLIQ

Scalable Developments for Big Data Analytics in Remote Sensing

MapReduce Evaluator: User Guide

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Stateful Distributed Dataflow Graphs: Imperative Big Data Programming for the Masses

Hadoop Parallel Data Processing

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Big Data Analytics. Tools and Techniques

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Machine Learning using MapReduce

Bayesian networks - Time-series models - Apache Spark & Scala

1 Idioms, Patterns, and Programming

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Fast Multipole Method for particle interactions: an open source parallel library component

Data Mining. Nonlinear Classification

Map-Reduce for Machine Learning on Multicore

Clustering UE 141 Spring 2013

Machine Learning Big Data using Map Reduce

Challenges for Data Driven Systems

Distributed forests for MapReduce-based machine learning

Lecture 1: Introduction

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

Beyond the Stars: Revisiting Virtual Cluster Embeddings

Knowledge Discovery and Data Mining

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Information Processing, Big Data, and the Cloud

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Machine Learning with MATLAB David Willingham Application Engineer

Monday Morning Data Mining

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Integrating Apache Spark with an Enterprise Data Warehouse

Web Document Clustering

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Chapter 15 Functional Programming Languages

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

UNIVERSITY OF LONDON (University College London) M.Sc. DEGREE 1998 COMPUTER SCIENCE D16: FUNCTIONAL PROGRAMMING. Answer THREE Questions.

The Data Mining Process

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Pitfalls and Best Practices in Role Engineering

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Introduction to Parallel Programming and MapReduce

INF4820, Algorithms for AI and NLP: More Common Lisp Vector Spaces

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Spark: Cluster Computing with Working Sets

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Numerical Analysis. Professor Donna Calhoun. Fall 2013 Math 465/565. Office : MG241A Office Hours : Wednesday 10:00-12:00 and 1:00-3:00

What is Analytic Infrastructure and Why Should You Care?

Fast Analytics on Big Data with H20

HPC enabling of OpenFOAM R for CFD applications

Programming Languages

DATA CLUSTERING USING MAPREDUCE

How To Understand And Understand Common Lisp

1 Choosing the right data mining techniques for the job (8 minutes,

Language Evaluation Criteria. Evaluation Criteria: Readability. Evaluation Criteria: Writability. ICOM 4036 Programming Languages

Scalable Parallel Clustering for Data Mining on Multicomputers

Green-Marl: A DSL for Easy and Efficient Graph Analysis

Robot Programming with Lisp

Big Data looks Tiny from the Stratosphere

Chapter 6. The stacking ensemble approach

The Stratosphere Big Data Analytics Platform

Register on projectbotticelli.com. Introduction to BI & Big Data DAX MDX Data Mining

PDQCollections*: A Data-Parallel Programming Model and Library for Associative Containers

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Using Peer to Peer Dynamic Querying in Grid Information Services

Data Mining & Data Stream Mining Open Source Tools

Experiments in Web Page Classification for Semantic Web

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Spark and the Big Data Library

SQL Server 2005 Features Comparison

Content-Based Recommendation

1 Introductory Comments. 2 Bayesian Probability

Knowledge Discovery and Data Mining

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Data Intensive Computing Handout 6 Hadoop

Model Deployment. Dr. Saed Sayad. University of Toronto

Software Engineering Reference Framework

Chapter 7. Cluster Analysis

Multicore Parallel Computing with OpenMP

Analysis Tools and Libraries for BigData

Transcription:

Data Mining Algorithms Parallelizing in Functional Programming Language for Execution in Cluster Ivan Holod, PhD, Assoc. Prof., iiholod@mail.ru Aleksey Malov, PhD, alexeimal-2@yandex.ru Sergey Rodionov, PhD, sv-rodion@mail.ru

Motivation Imperative programming languages (Java, C / C ++, Fortran and others.) are not suitable for parallel execution. Existing expansion of imperative languages for parallel execution (Ada, High Performance FORTRAN, High Performance C ++ and others.) allow to parallelize only individual structures such as cycles. Alternative imperative languages are functional languages (Lisp, Haskell et al.). We propose to use functional programming language to reduce the efforts for parallelization of data mining algorithms.

Related works Existing algorithm implementations in functional programming languages do not suggest parallel execution. Algorithms were implemented by imperative programming languages: Constructing parallel data mining algorithms Universal algorithm parallelizing NIMBLE system Multicore map-reduce framework Individual algorithm parallelizing Decision tree: MWK, SUBTREE, SLIQ, SPRINT, DP, PDT Association: CCPD, PCCD, APM, HPA, SPA, PE, PME, PC,PMC Clustering: DIB, Collaborative, Pkmeans, MAFIA, P-CLUSTER

Conception The algorithm can be built from separated blocks if these blocks have the following features: they are interchangeable have unify interface they are executed in arbitrary order have features of pure function The features of pure function: the function always produces the same result given the same argument(s). calculation of the result does not cause any semantically observable side effect.

Church Rosser theorem When applying reduction rules to terms in the λ-calculus, the ordering in which the reductions are chosen does not make a difference for the final result. Pure functions can be executed in any order (include parallel order). Functional block is part of algorithm that is the pure function with unified interface.

Data mining algorithm as functional expression (defun <algorithm_function_name> (d) (fb n d (fb n-1 d (fb i d (fb 1 d nil) ) )) (defun <function_name> (d m)) where: (defstruct d attr_list vectors_list) (defstruct m state rules_list)

Conditional function (defun condition_function (d m cf fb t fb f ) (cond ((cf d m) (fb t d m)) (T (fb f d m))) Example: (defun is_curr_attr_target (d m) (condition_function d m (eql (nth m-state-curr_attr d-attr_list) m-state-targed_attr) m for_all_vectors_cycle)) ; cf function ; fb t function ; fb f function

Cycle function (defun cycle_function (d m cf fb init fb pre fb iter ) (cycle( d (fb init d m) cf fb pre fb iter ))) (defun cycle (d m cf fb pre fb iter ) (cond ((cf d m) (cycle d (fb iter d (fb pre d m)) cf fb pre fb iter )) (T (fb iter d (fb pre d m))))))) Example: (defun for_all_attributes_cycle (d m) (cycle_function d m (eql nil d-attr_list) ;cf function m ;fb init function cdr d-attr_list ;fb pre function inc_count_attribute d m)) ;fb iter function

Function for parallelization of algorithm (funcname mpi_parallel_function (d m merge split fb) (mpi:mpi-init) ;initialize MPI (merge d (split d m fb)) (mpi:mpi-finalize)) Example: (defun mpi_vector_parall (d m) mpi_parallel_function (d m mpi_merge_vector_parall ;the merge function mpi_split_vector_parall ;the split function for_all_classes_cycle)) ;the fb function

for all attributes a if a is not target attribute for all vectors w increment count of vectors end for all vectors; end if end for all attributes; for all classes c for all vectors w increment count of vectors for end for all vectors end for all classes c Naive Bayes algorithm class of vector w; (defun NBAlgorithm (d, nil) for_all_classes_cycle ( d for_all_attributes_cycle (d, nil))) (defun for_all_classes_cycle (d m) (cycle_function d m (eql nil d-cls_list) m cdr d-cls_list for_all_vectors_cycle (d m)) (defun for_all_attributes_cycle (d m) (cycle_function d m (eql nil d-attr_list) m cdr d-attr_list for_all_vectors_cycle (d m)) (defun for_all_vectors_cycle (d m) (cycle_function d m (eql nil d-vectors_list) m cdr d-vectors_list inc_count_vec (d m))

Performing parallel processing Parallel processing of vectors Parallel processing of attributes (defun NBAlgorithmVectorsParallel (d) mpi_vector_parall (d for_all_classes_cycle ( d for_all_attributes_cycle (d)))) (defun NBAlgorithmAttrsParallel (d) for_all_classes_cycle ( d mpi_attr_parall (d for_all_attributes_cycle (d)))) (defun mpi_vector_parall (d m) mpi_parallel_function (d m mpi_merge_vector_parall mpi_split_vector_parall for_all_classes_cycle)) (defun mpi_attr_parall (d m) mpi_parallel_function (d m mpi_merge_attr_parall mpi_split_attr_parall for_all_attributes_cycle))

Experimental results (sec) Algorithm W1 W5 W10 A1 A5 A10 C1 C5 C10 Number of vectors 10 000 50 000 100 000 100 100 100 1 000 1 000 1 000 Number of attributes 10 10 10 100 500 1 000 100 100 100 Avg. number of classes 5 5 5 100 100 100 100 500 1000 Algorithm Cores W1 W5 W10 A1 A5 A10 C1 C5 C10 NBAlgorithm 1 0.09 0.45 0.87 0.51 10.44 40.00 0.55 2.61 5.20 NBAlgorithm VectorsParallel NBAlgorithm AttrsParallel 2 0.05 0.24 0.45 0.28 5.33 20.51 0.28 1.33 2.67 4 0.03 0.13 0.28 0.19 3.39 12.30 0.17 0.74 1.50 2 0.05 0.25 0.48 0.28 5.27 20.34 0.27 1.32 2.65 4 0.04 0.19 0.41 0.16 2.96 11.47 0.17 0.74 1.49

Conclusion The proposed approach uses the functional programming language for implementation of data mining algorithms as a functional expression from the functional blocks. It helps to create new algorithms, including parallel forms, from existing blocks with less efforts or modify the existing algorithms by replacing separate blocks. We implemented the Naïve Bayes algorithm and two its parallel forms: parallelizing by vectors and parallelizing by attributes. An experimental comparison of these versions showed their effectiveness for the data sets with different parameters.

Thank you for attention Ivan Holod, iiholod@mail.ru Aleksey Malov, alexeimal-2@yandex.ru Sergey Rodionov, sv-rodion@mail.ru