MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty of Computer Science and Management {wojciech.indyk,tomasz.kajdanowicz,kazienko,slawomir.plamowski}@pwr.wroc.pl Abstract. The collective classification problem for big data sets using MapReduce programming model was considered in the paper. We introduced a proposal for implementation of label propagation algorithm in the network. The method was examined on real dataset in telecommunication domain. The results indicated that it can be used to classify nodes in order to propose new offerings or tariffs to customers. 1 Keywords: MapReduce, collective classification, classification in networks, label propagation 1 Introduction Relations between objects in many various systems are commonly modelled by networks. For instance, those are hyperlinks connecting web pages, papers citations, conversations via email or social interaction in social portals. Network models are further a base for different types of processing and analyses. One of them is node classification (labelling of the nodes in the network). Node classification has a deep theoretical background, however, due to new phenomenon appearing in artificial environments like social networks on the Internet, the problem of node classification is being recently re-invented and re-implemented. Nodes may be classified in networks either by inference based on known profiles of these nodes (regular concept of classification) or based on relational information derived from the network. This second approach utilizes information about connections between nodes (structure of the network) and can be very useful in assigning labels to the nodes being classified. For example, it is very likely that a given web page x is related to sport (label sport), if x is linked by many other web pages about sport. Hence, a form of collective classification should be provided, with simultaneous decision making on every node s label rather than classifying each node separately. Such approach allows taking into account correlations between connected nodes, which deliver usually undervalued knowledge. 1 This is not the final version of this paper. You can find the final version on the publisher web page.
2 Moreover, arising trend of data explosion in transactional systems requires more sophisticated methods in order to analyse enormous amount of data. There is a huge need to process big data in parallel, especially in complex analysis like collective classification. MapReduce approach to collective classification which is able to perform processing on huge data is proposed and examined in the paper. Section 2 covers related work while in Section 3 appears a proposal of MapReduce approach to label propagation in the network. Section 4, contain description of the experimental setup and obtained results. The paper is concluded in Section 5. 2 Related Work 2.1 Collective classification Collective classification problems, may be solved using two main approaches: within-network and across-network inference. Within-network classification, for which training entities are connected directly to entities, whose labels are to be classified, stays in contrast to across-network classification, where models learnt from one network are applied to another similar network [8]. Overall, the networked data have several unique characteristics that simultaneously complicate and provide leverage to learning and classification. Among others, statistical relational learning (SRL) techniques were introduced, including probabilistic relational models, relational Markov networks, and probabilistic entity-relationship models [9, 10]. Two distinct types of classification in networks may be distinguished: based on collection of local conditional classifiers and based on the classification stated as one global objective function. The most known implementations of the first approach are iterative classification (ICA) and Gibbs sampling algorithm (GS), whereas example of the latter are loopy belief propagation (LBP) and mean-field relaxation labeling (MF) [11]. 2.2 MapReduce programming model MapReduce is a programming model for data processing derived from functional language[3]. MapReduce breaks the processing into two consecutive phases: the map and the reduce phase. Usually, big data processing requires parallel execution and MapReduce provides and manages such functions. It starts with data splitting into separate chunks. Each data chunk must meet the requirement of < key, value > format, according to input file configuration. Then each data chunk is processed by a Map function. Map, takes an input pair and results with a set of < key, value > pairs. All values associated with the same key are grouped together and propagated to Reduce phase. The Reduce function, accepts a key and a set of values for that key. The function performs some processing of entered values and returns a new pair < key, value > to be saved as an output of processing. Usually reducers results in one < key, value > pair. Both, Map and Reduce phases need to be specified and implemented by user[1, 2]. The aforementioned process is presented in figure 2.2.
3 Fig. 1. The MapReduce programming model The MapReduce is able to process very large datasets thanks to initial split of data into small chunks. The most common open-source implementation of MapReduce model is Apache Hadoop library[4]. Apache Hadoop is a framework that allows distributed processing of large data sets. It can be done across clusters of computers and offers local computation and storage. The architectural properties of Hadoop deliver high-availability not due to hardware but application layer failures handling. The single MapReduce phase in Hadoop is named Job. The Job consist of map method, reduce method, data inputfiles and configuration. 3 Collective Classification by Means of Label Propagation Using MapReduce The most common way to utilize the information of labelled and unlabelled data is to construct a graph from data and perform a Markov random walk on it. The idea of Markov random walk has been used multiple times [5 7] and involves defining a probability distribution over the labels for each node in the graph. In case of labelled nodes the distribution reflects the true labels. The aim then is to recover this distribution for the unlabelled nodes. Using such a Label Propagation approach allows performing classification based on relational data. Let G(V, E, W ) denote a graph with vertices V, edges E and an n n edge weight matrix W. According to [6] in a weighted graph G(V,E,W) with n = V vertices, label propagation may be solved by linear equations 1 and 2. i, j V w ij F i = w ij F j (1) (i,j) E i V c classes(i) (i,j) E F i = 1 (2) where F i denotes the probability density of classes for node i. Let assume the set of nodes V is partitioned into labelled V L and unlabelled V U vertices, V = V L V U. Let F u denote the probability distribution over the labels associated
4 with vertex u V. For each node v V L, for which F v is known, a dummy node v is inserted such that w vv = 1 and F v = F v. This operation is equivalent to clamping discussed in [6]. Let V D be the set of dummy nodes. Then solution of equations 1 and 2 can be performed according to Iterative Label Propagation algorithm 3. Algorithm 1 The pseudo code of Iterative Label Propagation algorithm. 1: repeat 2: for all v (V V D) do 3: F v = (u,v) E wuvfu (u,v)wuv 4: end for 5: until convergence As it can be observed, at each iteration of Iterative Label Propagation certain operations on each of nodes are performed. These operations are calculated basing on local information only, namely node s neighbourhoods. This fact can be utilized in parallel version of algorithm, see algorithm 2. Algorithm 2 The pseudo code of MapReduce approach to Iterative Label Propagation algorithm. 1: map < node; adjacencylist > 2: for all n adjacencylist do 3: propagate< n; node.label, n.weight > 4: end for 1: reduce < n, list(node.label, weight) > 2: propagate< n, node.label weigth weight > MapReduce version of Iterative Label Propagation algorithm consist of two phase. The Map phase gets all labelled and dummy nodes and propagate their labels to all nodes in adjacency list taking into account edge weights between nodes. The Reduce phase calculates new label for each node with at least one labelled neighbour. Reducers calculates new label for nodes based on a list of labelled neighbours and relation strength between nodes (weight). The final result, namely a new label for a particular node, is computed as weighted sum of labels probabilities from neighbourhood. 4 Experiments and Results For the purpose of experimental setup the telecommunication network was built over 3 months history of phone calls from leading European telecommunication
5 company. The original dataset consisted of about 500 000 000 phone calls and more than 16 million unique users. All communication facts (phone calls) were performed using one of 38 tariffs, of 4 types each. In order to limit the amount of data and simplify the task to meet hardware environment limitations only two types of phone calls were extracted and utilized in experiments. Users were labelled with class conditional probability of tariffs, namely sum of outcoming phone calls durations in particular tariff was divided by summarized duration of all outcoming calls. Eventually, final dataset consisted of 38585 users. Afterwards, the users network was calculated, where connection strength between particular users was calculated according to equation 3. e ij = 2 d ij d i + d j (3) where d ij denotes summarized duration of calls between user i and j, d i - summarized duration of ith outcoming calls and d j - summarized duration of jth incoming calls. Obtained network was composed of 55297 weighted edges between aforementioned users. The goal of the experiment was to predict class conditional probability of tariff for unlabelled users. Initial amount of labelled nodes (training set) for collective prediction was established to 37% randomly chosen users, according to uniform distribution. The rest of nodes should potentially belong to test set, however due to the property of examined algorithm some of nodes were unable to be reached and this same to have a label assigned. This mean that some of nodes did not posses incoming edges and the algorithm was not able to propagate the probability of labels to them. Eventually, the final test set was composed of only 2% of users distributed over the whole network. Nevertheless, the rest of nodes were utilized to keep the structure of network and propagation of labels, please see figure 4. The collective classification algorithm was implemented in MapReduce programming model. It consists of six Jobs, each accomplishing map-reduce phases. Detailed description of Jobs is presented in table 1. The convergence criterion in the algorithm has been controlled by ɛ change of conditional probability for each node. The algorithm was iterating until the this change was greater than ɛ. The experiment was organised in order to examine the computational time devoted for each of map-reduce steps as well as the number of iterations of the algorithm. The time was measured for three distinct values of ɛ = {0.01, 0.001, 0.0001}. The final assessment of implemented algorithm was measured using mean square error between predicted label probability and known (true) label probability. The Mean Square Error (MSE) equals 0.1522 for all three ɛ values. Therefore we did not observe significant changes in the performance of algorithm while examining different values of convergence criterion ɛ. However, as presented in table 2 and figure 3 the value of convergence criterion ɛ has an impact on number of executions of implemented jobs. The less restrictive it is, the less executions of jobs to be performed.
6 Fig. 2. Types of nodes that have been utilized in the experiments: labelled and unlabelled, training and testing ones, used only for label propagation and omitted Fig. 3. Execution time in [s] of map-reduce jobs for distinct convergence criterion ɛ
7 Table 1. MapReduce jobs implemented in the algorithm. Job name Job description adjacencylist the job takes edge list as an input and returns an adjacency list for all nodes dummyadjlistandlabels the job creates a list of dummy nodes with labels according to algorithm [9] and updates an adjacency list by newly created edges from dummy nodes mergeadjlistandlabel the job merges a list of nodes labels with adjacency list resulting in collective classification input collectiveclassification the job processes collective classification data according to algorithm and results in new label list singlelabelscomparison the job results with absolute difference of class conditional probability of labels from actual iteration and previous iteration alllabelcomparison the job returns maximal difference of input list (absolute difference of class conditional probability) Table 2. Execution time in [s] and number of executions of map-reduce jobs for distinct convergence criterion ɛ Job name ɛ = 0.01 ɛ = 0.001 ɛ = 0.0001 No. exec. Time No. exec. Time No. exec. Time adjacencylist 19 1 19 1 19 1 dummyadjlistandlabels 17 1 17 1 17 1 mergeadjlistandlabel 134 7 192 10 245 13 collectiveclassification 117 7 164 10 216 13 singlelabelscomparison 101 6 152 9 223 12 alllabelcomparison 96 6 145 9 210 12 The results obtained during experiments (MSE, execution time) indicate that proposed MapReduce approach for implementation of Iterative Label Propagation algorithm correctly performs parallel computation and results with satisfactory prediction results. Moreover it is able to accomplish prediction on big dataset, impossible to achieve in single thread version of algorithm in reasonable time. 5 Conclusions The problem collective classification using MapReduce programming model was considered in the paper. We introduced a proposal for implementation of Iterative Label Propagation algorithm in the network. Thanks to that, the method can perform complicated calculation using big data sets.
8 The proposed method was examined on real dataset in telecommunication domain. The results indicated that it can be used to classify nodes in order to propose new offerings or tariffs to customers. Further experimentation will consider a comparison of the presented method with other approaches. Moreover, further studies with much bigger data will be conducted. Acknowledgement This work was supported by The Polish National Center of Science the research project 2011-2012, 2011-2014 and Fellowship co-financed by The European Union within The European Social Fund. References 1. Ekanayake, J., Pallickara, S., Fox, G., MapReduce for Data Intensive Scientific Analyses, Proceedings of the 2008 Fourth IEEE International Conference on escience, 2008. 2. Dean, J., Ghemawat, S., Mapreduce: simplified data processing on large clusters. Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, Berkeley, USA, 2004, USENIX Association, pp. 1024, 2004. 3. White, T., Hadoop: The Definitive Guide, O Reilly, 2009. 4. Hadoop official web site: hadoop.apache.org, 05.11.2011. 5. Szummer, M., Jaakkola, T., Clustering and efficient use of unlabeled examples. In Proceedings of Neural Information Processing Systems (NIPS), 2001. 6. Zhu, X., Ghahramani, Z., Lafferty, J., Semi-supervised learning using Gaussian fields and harmonic functions. In Proceedings of the International Conference on Machine Learning (ICML), 2003. 7. Azran, A., The rendezvous algorithm: Multiclass semi-supervised learning with markov random walks. In Proceedings of the International Conference on Machine Learning (ICML), 2007. 8. Jensen, D., Neville, J., Gallagher, B. Why collective inference improves relational classification. In the proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 593-598, 2004. 9. Desrosiers C. and Karypis G., Within-network classification using local structure similarity. Lecture Notes in Computer Science 5781, pp. 260-275, 2009. 10. Knobbe, A., dehaas, M., and Siebes, A. Propositionalisation and aggregates. In proceedings of the Fifth European Conference on Principles of Data Mining and Knowledge Discovery, pp. 277-288, 2001. 11. Kramer, S., Lavrac, N., and Flach, P. Propositionalization approaches to relational data mining. In: Dezeroski S. (ed.) Relational Data Mining, Springer-Verlag, pp. 262-286, 2001.