MapReduce Approach to Collective Classification for Networks



Similar documents
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

PLAANN as a Classification Tool for Customer Intelligence in Banking

Analysis of MapReduce Algorithms

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Map-Reduce for Machine Learning on Multicore

BSPCloud: A Hybrid Programming Library for Cloud Computing *

A Study on Data Analysis Process Management System in MapReduce using BPM

Chapter 7. Using Hadoop Cluster and MapReduce

Distributed Framework for Data Mining As a Service on Private Cloud

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Distributed forests for MapReduce-based machine learning

DATA ANALYSIS II. Matrix Algorithms

Learning with Local and Global Consistency

Learning with Local and Global Consistency

MOBILE SOCIAL NETWORKS FOR LIVE MEETINGS

Big Data with Rough Set Using Map- Reduce

Evaluating partitioning of big graphs

Graph Processing and Social Networks

Practical Graph Mining with R. 5. Link Analysis

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Ranking and Semi-supervised Classification on Large Scale Graphs Using Map-Reduce

Machine Learning using MapReduce

Machine Learning over Big Data

Categorical Data Visualization and Clustering Using Subjective Factors

A Learning Based Method for Super-Resolution of Low Resolution Images

Advanced Ensemble Strategies for Polynomial Models

Apache Hama Design Document v0.6

Classification On The Clouds Using MapReduce

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Using Data Mining for Mobile Communication Clustering and Characterization

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Mining Interesting Medical Knowledge from Big Data

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Map/Reduce Affinity Propagation Clustering Algorithm

Hadoop Technology for Flow Analysis of the Internet Traffic

Open Access Research on Application of Neural Network in Computer Network Security Evaluation. Shujuan Jin *

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Big Data: Big N. V.C Note. December 2, 2014

Tracking Groups of Pedestrians in Video Sequences

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

A Performance Analysis of Distributed Indexing using Terrier

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications


Classification algorithm in Data mining: An Overview

A Stock Pattern Recognition Algorithm Based on Neural Networks

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

SCAN: A Structural Clustering Algorithm for Networks

Tensor Factorization for Multi-Relational Learning

NEURAL NETWORKS IN DATA MINING

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Big Data and Apache Hadoop s MapReduce

Ensembles and PMML in KNIME

Social Media Mining. Data Mining Essentials

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Distributed Structured Prediction for Big Data

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Research on the Performance Optimization of Hadoop in Big Data Environment

Hadoop Scheduler w i t h Deadline Constraint

Cloud Computing based on the Hadoop Platform

MapReduce/Bigtable for Distributed Optimization

Fault Tolerance in Hadoop for Work Migration

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Graph Mining and Social Network Analysis

Large Scale Multi-view Learning on MapReduce

Introducing diversity among the models of multi-label classification ensemble

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Self Organizing Maps for Visualization of Categories

How To Find Influence Between Two Concepts In A Network

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Component Ordering in Independent Component Analysis Based on Data Power

Semantic Search in Portals using Ontologies

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Course: Model, Learning, and Inference: Lecture 5

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

A Content based Spam Filtering Using Optical Back Propagation Technique

Resource Scalability for Efficient Parallel Processing in Cloud

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Stabilization by Conceptual Duplication in Adaptive Resonance Theory

Neural Networks and Back Propagation Algorithm

Introduction to Parallel Programming and MapReduce

Data Mining in the Swamp

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

Implementing Graph Pattern Mining for Big Data in the Cloud

Transcription:

MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty of Computer Science and Management {wojciech.indyk,tomasz.kajdanowicz,kazienko,slawomir.plamowski}@pwr.wroc.pl Abstract. The collective classification problem for big data sets using MapReduce programming model was considered in the paper. We introduced a proposal for implementation of label propagation algorithm in the network. The method was examined on real dataset in telecommunication domain. The results indicated that it can be used to classify nodes in order to propose new offerings or tariffs to customers. 1 Keywords: MapReduce, collective classification, classification in networks, label propagation 1 Introduction Relations between objects in many various systems are commonly modelled by networks. For instance, those are hyperlinks connecting web pages, papers citations, conversations via email or social interaction in social portals. Network models are further a base for different types of processing and analyses. One of them is node classification (labelling of the nodes in the network). Node classification has a deep theoretical background, however, due to new phenomenon appearing in artificial environments like social networks on the Internet, the problem of node classification is being recently re-invented and re-implemented. Nodes may be classified in networks either by inference based on known profiles of these nodes (regular concept of classification) or based on relational information derived from the network. This second approach utilizes information about connections between nodes (structure of the network) and can be very useful in assigning labels to the nodes being classified. For example, it is very likely that a given web page x is related to sport (label sport), if x is linked by many other web pages about sport. Hence, a form of collective classification should be provided, with simultaneous decision making on every node s label rather than classifying each node separately. Such approach allows taking into account correlations between connected nodes, which deliver usually undervalued knowledge. 1 This is not the final version of this paper. You can find the final version on the publisher web page.

2 Moreover, arising trend of data explosion in transactional systems requires more sophisticated methods in order to analyse enormous amount of data. There is a huge need to process big data in parallel, especially in complex analysis like collective classification. MapReduce approach to collective classification which is able to perform processing on huge data is proposed and examined in the paper. Section 2 covers related work while in Section 3 appears a proposal of MapReduce approach to label propagation in the network. Section 4, contain description of the experimental setup and obtained results. The paper is concluded in Section 5. 2 Related Work 2.1 Collective classification Collective classification problems, may be solved using two main approaches: within-network and across-network inference. Within-network classification, for which training entities are connected directly to entities, whose labels are to be classified, stays in contrast to across-network classification, where models learnt from one network are applied to another similar network [8]. Overall, the networked data have several unique characteristics that simultaneously complicate and provide leverage to learning and classification. Among others, statistical relational learning (SRL) techniques were introduced, including probabilistic relational models, relational Markov networks, and probabilistic entity-relationship models [9, 10]. Two distinct types of classification in networks may be distinguished: based on collection of local conditional classifiers and based on the classification stated as one global objective function. The most known implementations of the first approach are iterative classification (ICA) and Gibbs sampling algorithm (GS), whereas example of the latter are loopy belief propagation (LBP) and mean-field relaxation labeling (MF) [11]. 2.2 MapReduce programming model MapReduce is a programming model for data processing derived from functional language[3]. MapReduce breaks the processing into two consecutive phases: the map and the reduce phase. Usually, big data processing requires parallel execution and MapReduce provides and manages such functions. It starts with data splitting into separate chunks. Each data chunk must meet the requirement of < key, value > format, according to input file configuration. Then each data chunk is processed by a Map function. Map, takes an input pair and results with a set of < key, value > pairs. All values associated with the same key are grouped together and propagated to Reduce phase. The Reduce function, accepts a key and a set of values for that key. The function performs some processing of entered values and returns a new pair < key, value > to be saved as an output of processing. Usually reducers results in one < key, value > pair. Both, Map and Reduce phases need to be specified and implemented by user[1, 2]. The aforementioned process is presented in figure 2.2.

3 Fig. 1. The MapReduce programming model The MapReduce is able to process very large datasets thanks to initial split of data into small chunks. The most common open-source implementation of MapReduce model is Apache Hadoop library[4]. Apache Hadoop is a framework that allows distributed processing of large data sets. It can be done across clusters of computers and offers local computation and storage. The architectural properties of Hadoop deliver high-availability not due to hardware but application layer failures handling. The single MapReduce phase in Hadoop is named Job. The Job consist of map method, reduce method, data inputfiles and configuration. 3 Collective Classification by Means of Label Propagation Using MapReduce The most common way to utilize the information of labelled and unlabelled data is to construct a graph from data and perform a Markov random walk on it. The idea of Markov random walk has been used multiple times [5 7] and involves defining a probability distribution over the labels for each node in the graph. In case of labelled nodes the distribution reflects the true labels. The aim then is to recover this distribution for the unlabelled nodes. Using such a Label Propagation approach allows performing classification based on relational data. Let G(V, E, W ) denote a graph with vertices V, edges E and an n n edge weight matrix W. According to [6] in a weighted graph G(V,E,W) with n = V vertices, label propagation may be solved by linear equations 1 and 2. i, j V w ij F i = w ij F j (1) (i,j) E i V c classes(i) (i,j) E F i = 1 (2) where F i denotes the probability density of classes for node i. Let assume the set of nodes V is partitioned into labelled V L and unlabelled V U vertices, V = V L V U. Let F u denote the probability distribution over the labels associated

4 with vertex u V. For each node v V L, for which F v is known, a dummy node v is inserted such that w vv = 1 and F v = F v. This operation is equivalent to clamping discussed in [6]. Let V D be the set of dummy nodes. Then solution of equations 1 and 2 can be performed according to Iterative Label Propagation algorithm 3. Algorithm 1 The pseudo code of Iterative Label Propagation algorithm. 1: repeat 2: for all v (V V D) do 3: F v = (u,v) E wuvfu (u,v)wuv 4: end for 5: until convergence As it can be observed, at each iteration of Iterative Label Propagation certain operations on each of nodes are performed. These operations are calculated basing on local information only, namely node s neighbourhoods. This fact can be utilized in parallel version of algorithm, see algorithm 2. Algorithm 2 The pseudo code of MapReduce approach to Iterative Label Propagation algorithm. 1: map < node; adjacencylist > 2: for all n adjacencylist do 3: propagate< n; node.label, n.weight > 4: end for 1: reduce < n, list(node.label, weight) > 2: propagate< n, node.label weigth weight > MapReduce version of Iterative Label Propagation algorithm consist of two phase. The Map phase gets all labelled and dummy nodes and propagate their labels to all nodes in adjacency list taking into account edge weights between nodes. The Reduce phase calculates new label for each node with at least one labelled neighbour. Reducers calculates new label for nodes based on a list of labelled neighbours and relation strength between nodes (weight). The final result, namely a new label for a particular node, is computed as weighted sum of labels probabilities from neighbourhood. 4 Experiments and Results For the purpose of experimental setup the telecommunication network was built over 3 months history of phone calls from leading European telecommunication

5 company. The original dataset consisted of about 500 000 000 phone calls and more than 16 million unique users. All communication facts (phone calls) were performed using one of 38 tariffs, of 4 types each. In order to limit the amount of data and simplify the task to meet hardware environment limitations only two types of phone calls were extracted and utilized in experiments. Users were labelled with class conditional probability of tariffs, namely sum of outcoming phone calls durations in particular tariff was divided by summarized duration of all outcoming calls. Eventually, final dataset consisted of 38585 users. Afterwards, the users network was calculated, where connection strength between particular users was calculated according to equation 3. e ij = 2 d ij d i + d j (3) where d ij denotes summarized duration of calls between user i and j, d i - summarized duration of ith outcoming calls and d j - summarized duration of jth incoming calls. Obtained network was composed of 55297 weighted edges between aforementioned users. The goal of the experiment was to predict class conditional probability of tariff for unlabelled users. Initial amount of labelled nodes (training set) for collective prediction was established to 37% randomly chosen users, according to uniform distribution. The rest of nodes should potentially belong to test set, however due to the property of examined algorithm some of nodes were unable to be reached and this same to have a label assigned. This mean that some of nodes did not posses incoming edges and the algorithm was not able to propagate the probability of labels to them. Eventually, the final test set was composed of only 2% of users distributed over the whole network. Nevertheless, the rest of nodes were utilized to keep the structure of network and propagation of labels, please see figure 4. The collective classification algorithm was implemented in MapReduce programming model. It consists of six Jobs, each accomplishing map-reduce phases. Detailed description of Jobs is presented in table 1. The convergence criterion in the algorithm has been controlled by ɛ change of conditional probability for each node. The algorithm was iterating until the this change was greater than ɛ. The experiment was organised in order to examine the computational time devoted for each of map-reduce steps as well as the number of iterations of the algorithm. The time was measured for three distinct values of ɛ = {0.01, 0.001, 0.0001}. The final assessment of implemented algorithm was measured using mean square error between predicted label probability and known (true) label probability. The Mean Square Error (MSE) equals 0.1522 for all three ɛ values. Therefore we did not observe significant changes in the performance of algorithm while examining different values of convergence criterion ɛ. However, as presented in table 2 and figure 3 the value of convergence criterion ɛ has an impact on number of executions of implemented jobs. The less restrictive it is, the less executions of jobs to be performed.

6 Fig. 2. Types of nodes that have been utilized in the experiments: labelled and unlabelled, training and testing ones, used only for label propagation and omitted Fig. 3. Execution time in [s] of map-reduce jobs for distinct convergence criterion ɛ

7 Table 1. MapReduce jobs implemented in the algorithm. Job name Job description adjacencylist the job takes edge list as an input and returns an adjacency list for all nodes dummyadjlistandlabels the job creates a list of dummy nodes with labels according to algorithm [9] and updates an adjacency list by newly created edges from dummy nodes mergeadjlistandlabel the job merges a list of nodes labels with adjacency list resulting in collective classification input collectiveclassification the job processes collective classification data according to algorithm and results in new label list singlelabelscomparison the job results with absolute difference of class conditional probability of labels from actual iteration and previous iteration alllabelcomparison the job returns maximal difference of input list (absolute difference of class conditional probability) Table 2. Execution time in [s] and number of executions of map-reduce jobs for distinct convergence criterion ɛ Job name ɛ = 0.01 ɛ = 0.001 ɛ = 0.0001 No. exec. Time No. exec. Time No. exec. Time adjacencylist 19 1 19 1 19 1 dummyadjlistandlabels 17 1 17 1 17 1 mergeadjlistandlabel 134 7 192 10 245 13 collectiveclassification 117 7 164 10 216 13 singlelabelscomparison 101 6 152 9 223 12 alllabelcomparison 96 6 145 9 210 12 The results obtained during experiments (MSE, execution time) indicate that proposed MapReduce approach for implementation of Iterative Label Propagation algorithm correctly performs parallel computation and results with satisfactory prediction results. Moreover it is able to accomplish prediction on big dataset, impossible to achieve in single thread version of algorithm in reasonable time. 5 Conclusions The problem collective classification using MapReduce programming model was considered in the paper. We introduced a proposal for implementation of Iterative Label Propagation algorithm in the network. Thanks to that, the method can perform complicated calculation using big data sets.

8 The proposed method was examined on real dataset in telecommunication domain. The results indicated that it can be used to classify nodes in order to propose new offerings or tariffs to customers. Further experimentation will consider a comparison of the presented method with other approaches. Moreover, further studies with much bigger data will be conducted. Acknowledgement This work was supported by The Polish National Center of Science the research project 2011-2012, 2011-2014 and Fellowship co-financed by The European Union within The European Social Fund. References 1. Ekanayake, J., Pallickara, S., Fox, G., MapReduce for Data Intensive Scientific Analyses, Proceedings of the 2008 Fourth IEEE International Conference on escience, 2008. 2. Dean, J., Ghemawat, S., Mapreduce: simplified data processing on large clusters. Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, Berkeley, USA, 2004, USENIX Association, pp. 1024, 2004. 3. White, T., Hadoop: The Definitive Guide, O Reilly, 2009. 4. Hadoop official web site: hadoop.apache.org, 05.11.2011. 5. Szummer, M., Jaakkola, T., Clustering and efficient use of unlabeled examples. In Proceedings of Neural Information Processing Systems (NIPS), 2001. 6. Zhu, X., Ghahramani, Z., Lafferty, J., Semi-supervised learning using Gaussian fields and harmonic functions. In Proceedings of the International Conference on Machine Learning (ICML), 2003. 7. Azran, A., The rendezvous algorithm: Multiclass semi-supervised learning with markov random walks. In Proceedings of the International Conference on Machine Learning (ICML), 2007. 8. Jensen, D., Neville, J., Gallagher, B. Why collective inference improves relational classification. In the proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 593-598, 2004. 9. Desrosiers C. and Karypis G., Within-network classification using local structure similarity. Lecture Notes in Computer Science 5781, pp. 260-275, 2009. 10. Knobbe, A., dehaas, M., and Siebes, A. Propositionalisation and aggregates. In proceedings of the Fifth European Conference on Principles of Data Mining and Knowledge Discovery, pp. 277-288, 2001. 11. Kramer, S., Lavrac, N., and Flach, P. Propositionalization approaches to relational data mining. In: Dezeroski S. (ed.) Relational Data Mining, Springer-Verlag, pp. 262-286, 2001.