MapReduce Approach to Collective Classification for Networks


 Annis Cook
 1 years ago
 Views:
Transcription
1 MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty of Computer Science and Management Abstract. The collective classification problem for big data sets using MapReduce programming model was considered in the paper. We introduced a proposal for implementation of label propagation algorithm in the network. The method was examined on real dataset in telecommunication domain. The results indicated that it can be used to classify nodes in order to propose new offerings or tariffs to customers. 1 Keywords: MapReduce, collective classification, classification in networks, label propagation 1 Introduction Relations between objects in many various systems are commonly modelled by networks. For instance, those are hyperlinks connecting web pages, papers citations, conversations via or social interaction in social portals. Network models are further a base for different types of processing and analyses. One of them is node classification (labelling of the nodes in the network). Node classification has a deep theoretical background, however, due to new phenomenon appearing in artificial environments like social networks on the Internet, the problem of node classification is being recently reinvented and reimplemented. Nodes may be classified in networks either by inference based on known profiles of these nodes (regular concept of classification) or based on relational information derived from the network. This second approach utilizes information about connections between nodes (structure of the network) and can be very useful in assigning labels to the nodes being classified. For example, it is very likely that a given web page x is related to sport (label sport), if x is linked by many other web pages about sport. Hence, a form of collective classification should be provided, with simultaneous decision making on every node s label rather than classifying each node separately. Such approach allows taking into account correlations between connected nodes, which deliver usually undervalued knowledge. 1 This is not the final version of this paper. You can find the final version on the publisher web page.
2 2 Moreover, arising trend of data explosion in transactional systems requires more sophisticated methods in order to analyse enormous amount of data. There is a huge need to process big data in parallel, especially in complex analysis like collective classification. MapReduce approach to collective classification which is able to perform processing on huge data is proposed and examined in the paper. Section 2 covers related work while in Section 3 appears a proposal of MapReduce approach to label propagation in the network. Section 4, contain description of the experimental setup and obtained results. The paper is concluded in Section 5. 2 Related Work 2.1 Collective classification Collective classification problems, may be solved using two main approaches: withinnetwork and acrossnetwork inference. Withinnetwork classification, for which training entities are connected directly to entities, whose labels are to be classified, stays in contrast to acrossnetwork classification, where models learnt from one network are applied to another similar network [8]. Overall, the networked data have several unique characteristics that simultaneously complicate and provide leverage to learning and classification. Among others, statistical relational learning (SRL) techniques were introduced, including probabilistic relational models, relational Markov networks, and probabilistic entityrelationship models [9, 10]. Two distinct types of classification in networks may be distinguished: based on collection of local conditional classifiers and based on the classification stated as one global objective function. The most known implementations of the first approach are iterative classification (ICA) and Gibbs sampling algorithm (GS), whereas example of the latter are loopy belief propagation (LBP) and meanfield relaxation labeling (MF) [11]. 2.2 MapReduce programming model MapReduce is a programming model for data processing derived from functional language[3]. MapReduce breaks the processing into two consecutive phases: the map and the reduce phase. Usually, big data processing requires parallel execution and MapReduce provides and manages such functions. It starts with data splitting into separate chunks. Each data chunk must meet the requirement of < key, value > format, according to input file configuration. Then each data chunk is processed by a Map function. Map, takes an input pair and results with a set of < key, value > pairs. All values associated with the same key are grouped together and propagated to Reduce phase. The Reduce function, accepts a key and a set of values for that key. The function performs some processing of entered values and returns a new pair < key, value > to be saved as an output of processing. Usually reducers results in one < key, value > pair. Both, Map and Reduce phases need to be specified and implemented by user[1, 2]. The aforementioned process is presented in figure 2.2.
3 3 Fig. 1. The MapReduce programming model The MapReduce is able to process very large datasets thanks to initial split of data into small chunks. The most common opensource implementation of MapReduce model is Apache Hadoop library[4]. Apache Hadoop is a framework that allows distributed processing of large data sets. It can be done across clusters of computers and offers local computation and storage. The architectural properties of Hadoop deliver highavailability not due to hardware but application layer failures handling. The single MapReduce phase in Hadoop is named Job. The Job consist of map method, reduce method, data inputfiles and configuration. 3 Collective Classification by Means of Label Propagation Using MapReduce The most common way to utilize the information of labelled and unlabelled data is to construct a graph from data and perform a Markov random walk on it. The idea of Markov random walk has been used multiple times [5 7] and involves defining a probability distribution over the labels for each node in the graph. In case of labelled nodes the distribution reflects the true labels. The aim then is to recover this distribution for the unlabelled nodes. Using such a Label Propagation approach allows performing classification based on relational data. Let G(V, E, W ) denote a graph with vertices V, edges E and an n n edge weight matrix W. According to [6] in a weighted graph G(V,E,W) with n = V vertices, label propagation may be solved by linear equations 1 and 2. i, j V w ij F i = w ij F j (1) (i,j) E i V c classes(i) (i,j) E F i = 1 (2) where F i denotes the probability density of classes for node i. Let assume the set of nodes V is partitioned into labelled V L and unlabelled V U vertices, V = V L V U. Let F u denote the probability distribution over the labels associated
4 4 with vertex u V. For each node v V L, for which F v is known, a dummy node v is inserted such that w vv = 1 and F v = F v. This operation is equivalent to clamping discussed in [6]. Let V D be the set of dummy nodes. Then solution of equations 1 and 2 can be performed according to Iterative Label Propagation algorithm 3. Algorithm 1 The pseudo code of Iterative Label Propagation algorithm. 1: repeat 2: for all v (V V D) do 3: F v = (u,v) E wuvfu (u,v)wuv 4: end for 5: until convergence As it can be observed, at each iteration of Iterative Label Propagation certain operations on each of nodes are performed. These operations are calculated basing on local information only, namely node s neighbourhoods. This fact can be utilized in parallel version of algorithm, see algorithm 2. Algorithm 2 The pseudo code of MapReduce approach to Iterative Label Propagation algorithm. 1: map < node; adjacencylist > 2: for all n adjacencylist do 3: propagate< n; node.label, n.weight > 4: end for 1: reduce < n, list(node.label, weight) > 2: propagate< n, node.label weigth weight > MapReduce version of Iterative Label Propagation algorithm consist of two phase. The Map phase gets all labelled and dummy nodes and propagate their labels to all nodes in adjacency list taking into account edge weights between nodes. The Reduce phase calculates new label for each node with at least one labelled neighbour. Reducers calculates new label for nodes based on a list of labelled neighbours and relation strength between nodes (weight). The final result, namely a new label for a particular node, is computed as weighted sum of labels probabilities from neighbourhood. 4 Experiments and Results For the purpose of experimental setup the telecommunication network was built over 3 months history of phone calls from leading European telecommunication
5 5 company. The original dataset consisted of about phone calls and more than 16 million unique users. All communication facts (phone calls) were performed using one of 38 tariffs, of 4 types each. In order to limit the amount of data and simplify the task to meet hardware environment limitations only two types of phone calls were extracted and utilized in experiments. Users were labelled with class conditional probability of tariffs, namely sum of outcoming phone calls durations in particular tariff was divided by summarized duration of all outcoming calls. Eventually, final dataset consisted of users. Afterwards, the users network was calculated, where connection strength between particular users was calculated according to equation 3. e ij = 2 d ij d i + d j (3) where d ij denotes summarized duration of calls between user i and j, d i  summarized duration of ith outcoming calls and d j  summarized duration of jth incoming calls. Obtained network was composed of weighted edges between aforementioned users. The goal of the experiment was to predict class conditional probability of tariff for unlabelled users. Initial amount of labelled nodes (training set) for collective prediction was established to 37% randomly chosen users, according to uniform distribution. The rest of nodes should potentially belong to test set, however due to the property of examined algorithm some of nodes were unable to be reached and this same to have a label assigned. This mean that some of nodes did not posses incoming edges and the algorithm was not able to propagate the probability of labels to them. Eventually, the final test set was composed of only 2% of users distributed over the whole network. Nevertheless, the rest of nodes were utilized to keep the structure of network and propagation of labels, please see figure 4. The collective classification algorithm was implemented in MapReduce programming model. It consists of six Jobs, each accomplishing mapreduce phases. Detailed description of Jobs is presented in table 1. The convergence criterion in the algorithm has been controlled by ɛ change of conditional probability for each node. The algorithm was iterating until the this change was greater than ɛ. The experiment was organised in order to examine the computational time devoted for each of mapreduce steps as well as the number of iterations of the algorithm. The time was measured for three distinct values of ɛ = {0.01, 0.001, }. The final assessment of implemented algorithm was measured using mean square error between predicted label probability and known (true) label probability. The Mean Square Error (MSE) equals for all three ɛ values. Therefore we did not observe significant changes in the performance of algorithm while examining different values of convergence criterion ɛ. However, as presented in table 2 and figure 3 the value of convergence criterion ɛ has an impact on number of executions of implemented jobs. The less restrictive it is, the less executions of jobs to be performed.
6 6 Fig. 2. Types of nodes that have been utilized in the experiments: labelled and unlabelled, training and testing ones, used only for label propagation and omitted Fig. 3. Execution time in [s] of mapreduce jobs for distinct convergence criterion ɛ
7 7 Table 1. MapReduce jobs implemented in the algorithm. Job name Job description adjacencylist the job takes edge list as an input and returns an adjacency list for all nodes dummyadjlistandlabels the job creates a list of dummy nodes with labels according to algorithm [9] and updates an adjacency list by newly created edges from dummy nodes mergeadjlistandlabel the job merges a list of nodes labels with adjacency list resulting in collective classification input collectiveclassification the job processes collective classification data according to algorithm and results in new label list singlelabelscomparison the job results with absolute difference of class conditional probability of labels from actual iteration and previous iteration alllabelcomparison the job returns maximal difference of input list (absolute difference of class conditional probability) Table 2. Execution time in [s] and number of executions of mapreduce jobs for distinct convergence criterion ɛ Job name ɛ = 0.01 ɛ = ɛ = No. exec. Time No. exec. Time No. exec. Time adjacencylist dummyadjlistandlabels mergeadjlistandlabel collectiveclassification singlelabelscomparison alllabelcomparison The results obtained during experiments (MSE, execution time) indicate that proposed MapReduce approach for implementation of Iterative Label Propagation algorithm correctly performs parallel computation and results with satisfactory prediction results. Moreover it is able to accomplish prediction on big dataset, impossible to achieve in single thread version of algorithm in reasonable time. 5 Conclusions The problem collective classification using MapReduce programming model was considered in the paper. We introduced a proposal for implementation of Iterative Label Propagation algorithm in the network. Thanks to that, the method can perform complicated calculation using big data sets.
8 8 The proposed method was examined on real dataset in telecommunication domain. The results indicated that it can be used to classify nodes in order to propose new offerings or tariffs to customers. Further experimentation will consider a comparison of the presented method with other approaches. Moreover, further studies with much bigger data will be conducted. Acknowledgement This work was supported by The Polish National Center of Science the research project , and Fellowship cofinanced by The European Union within The European Social Fund. References 1. Ekanayake, J., Pallickara, S., Fox, G., MapReduce for Data Intensive Scientific Analyses, Proceedings of the 2008 Fourth IEEE International Conference on escience, Dean, J., Ghemawat, S., Mapreduce: simplified data processing on large clusters. Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, Berkeley, USA, 2004, USENIX Association, pp. 1024, White, T., Hadoop: The Definitive Guide, O Reilly, Hadoop official web site: hadoop.apache.org, Szummer, M., Jaakkola, T., Clustering and efficient use of unlabeled examples. In Proceedings of Neural Information Processing Systems (NIPS), Zhu, X., Ghahramani, Z., Lafferty, J., Semisupervised learning using Gaussian fields and harmonic functions. In Proceedings of the International Conference on Machine Learning (ICML), Azran, A., The rendezvous algorithm: Multiclass semisupervised learning with markov random walks. In Proceedings of the International Conference on Machine Learning (ICML), Jensen, D., Neville, J., Gallagher, B. Why collective inference improves relational classification. In the proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp , Desrosiers C. and Karypis G., Withinnetwork classification using local structure similarity. Lecture Notes in Computer Science 5781, pp , Knobbe, A., dehaas, M., and Siebes, A. Propositionalisation and aggregates. In proceedings of the Fifth European Conference on Principles of Data Mining and Knowledge Discovery, pp , Kramer, S., Lavrac, N., and Flach, P. Propositionalization approaches to relational data mining. In: Dezeroski S. (ed.) Relational Data Mining, SpringerVerlag, pp , 2001.
Classification in Networked Data: A Toolkit and a Univariate Case Study
Journal of Machine Learning Research 8 (27) 935983 Submitted /5; Revised 6/6; Published 5/7 Classification in Networked Data: A Toolkit and a Univariate Case Study Sofus A. Macskassy Fetch Technologies,
More informationMining Advertiserspecific User Behavior Using Adfactors
Mining Advertiserspecific User Behavior Using Adfactors Nikolay Archak New York University, Leonard N. Stern School of Business 44 West 4th Street, Suite 8185 New York, NY, 10012 narchak@stern.nyu.edu
More informationA CostBenefit Analysis of Indexing Big Data with MapReduce
A CostBenefit Analysis of Indexing Big Data with MapReduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace
More informationMODEL SELECTION FOR SOCIAL NETWORKS USING GRAPHLETS
MODEL SELECTION FOR SOCIAL NETWORKS USING GRAPHLETS JEANNETTE JANSSEN, MATT HURSHMAN, AND NAUZER KALYANIWALLA Abstract. Several network models have been proposed to explain the link structure observed
More informationPASIF A Framework for Supporting Smart Interactions with Predictive Analytics
PASIF A Framework for Supporting Smart Interactions with Predictive Analytics by Sarah Marie Matheson A thesis submitted to the School of Computing in conformity with the requirements for the degree of
More informationCorrelation discovery from network monitoring data in a big data cluster
Correlation discovery from network monitoring data in a big data cluster Kim Ervasti University of Helsinki kim.ervasti@gmail.com ABSTRACT Monitoring of telecommunications network is a tedious task. Diverse
More informationLearning to Recognize Reliable Users and Content in Social Media with Coupled Mutual Reinforcement
Learning to Recognize Reliable Users and Content in Social Media with Coupled Mutual Reinforcement Jiang Bian College of Computing Georgia Institute of Technology jbian3@mail.gatech.edu Eugene Agichtein
More informationSLAaware Provisioning and Scheduling of Cloud Resources for Big Data Analytics
SLAaware Provisioning and Scheduling of Cloud Resources for Big Data Analytics Mohammed Alrokayan, Amir Vahid Dastjerdi, and Rajkumar Buyya Cloud Computing and Distributed Systems (CLOUDS) Laboratory,
More informationExploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
More informationThe JXP Method for Robust PageRank Approximation in a PeertoPeer Web Search Network
The VLDB Journal manuscript No. (will be inserted by the editor) Josiane Xavier Parreira Carlos Castillo Debora Donato Sebastian Michel Gerhard Weikum The JXP Method for Robust PageRank Approximation in
More informationAnomalous System Call Detection
Anomalous System Call Detection Darren Mutz, Fredrik Valeur, Christopher Kruegel, and Giovanni Vigna Reliable Software Group, University of California, Santa Barbara Secure Systems Lab, Technical University
More informationIEEE/ACM TRANSACTIONS ON NETWORKING 1
IEEE/ACM TRANSACTIONS ON NETWORKING 1 SelfChord: A BioInspired P2P Framework for SelfOrganizing Distributed Systems Agostino Forestiero, Associate Member, IEEE, Emilio Leonardi, Senior Member, IEEE,
More informationISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING
ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING Jaseena K.U. 1 and Julie M. David 2 1,2 Department of Computer Applications, M.E.S College, Marampally, Aluva, Cochin, India 1 jaseena.mes@gmail.com,
More informationRouting Questions for Collaborative Answering in Community Question Answering
Routing Questions for Collaborative Answering in Community Question Answering Shuo Chang Dept. of Computer Science University of Minnesota Email: schang@cs.umn.edu Aditya Pal IBM Research Email: apal@us.ibm.com
More informationA hybrid evolutionary approach for heterogeneous multiprocessor scheduling
Soft Comput DOI 0.00/s000000806 FOCUS A hybrid evolutionary approach for heterogeneous multiprocessor scheduling C. K. Goh E. J. Teoh K. C. Tan SpringerVerlag 008 Abstract This article investigates
More informationSIMPL A Framework for Accessing External Data in Simulation Workflows
Peter Reimann b Michael Reiter a Holger Schwarz b Dimka Karastoyanova a Frank Leymann a SIMPL A Framework for Accessing External Data in Simulation Workflows Stuttgart, March 20 a Institute of Architecture
More informationTop 10 algorithms in data mining
Knowl Inf Syst (2008) 14:1 37 DOI 10.1007/s1011500701142 SURVEY PAPER Top 10 algorithms in data mining Xindong Wu Vipin Kumar J. Ross Quinlan Joydeep Ghosh Qiang Yang Hiroshi Motoda Geoffrey J. McLachlan
More informationAddressing Cold Start in Recommender Systems: A Semisupervised Cotraining Algorithm
Addressing Cold Start in Recommender Systems: A Semisupervised Cotraining Algorithm Mi Zhang,2 Jie Tang 3 Xuchen Zhang,2 Xiangyang Xue,2 School of Computer Science, Fudan University 2 Shanghai Key Laboratory
More informationES 2 : A Cloud Data Storage System for Supporting Both OLTP and OLAP
ES 2 : A Cloud Data Storage System for Supporting Both OLTP and OLAP Yu Cao, Chun Chen,FeiGuo, Dawei Jiang,YutingLin, Beng Chin Ooi, Hoang Tam Vo,SaiWu, Quanqing Xu School of Computing, National University
More informationThe Stratosphere platform for big data analytics
The VLDB Journal DOI 10.1007/s007780140357y REGULAR PAPER The Stratosphere platform for big data analytics Alexander Alexandrov Rico Bergmann Stephan Ewen JohannChristoph Freytag Fabian Hueske Arvid
More informationMining Templates from Search Result Records of Search Engines
Mining Templates from Search Result Records of Search Engines Hongkun Zhao, Weiyi Meng State University of New York at Binghamton Binghamton, NY 13902, USA {hkzhao, meng}@cs.binghamton.edu Clement Yu University
More informationFast Web Page Allocation On a Server Using Self Organizing Properties of Neural Networks
Fast Web Page Allocation On a Server Using Self Organizing Properties of Neural Networks Vir V. Phoha S. S. Iyengar R. Kannan Computer Science Computer Science Computer Science Louisiana Tech University
More information1. Adaptation of cases for casebased forecasting with neural network support
1. Adaptation of cases for casebased forecasting with neural network support Corchado J. M. Artificial Intelligence Research Group Escuela Superior de Ingeniería Informática, University of Vigo, Campus
More informationCost aware real time big data processing in Cloud Environments
Cost aware real time big data processing in Cloud Environments By Cristian Montero Under the supervision of Professor Rajkumar Buyya and Dr. Amir Vahid A minor project thesis submitted in partial fulfilment
More informationClean Answers over Dirty Databases: A Probabilistic Approach
Clean Answers over Dirty Databases: A Probabilistic Approach Periklis Andritsos University of Trento periklis@dit.unitn.it Ariel Fuxman University of Toronto afuxman@cs.toronto.edu Renée J. Miller University
More informationAROM: Processing Big Data With Data Flow Graphs and Functional Programming
AROM: Processing Big Data With Data Flow Graphs and Functional Programming NamLuc Tran and Sabri Skhiri Euranova R&D Belgium Email: {namluc.tran, sabri.skhiri@euranova.eu Arthur Lesuisse and Esteban Zimányi
More informationStructural and functional analytics for community detection in largescale complex networks
Chopade and Zhan Journal of Big Data DOI 10.1186/s405370150019y RESEARCH Open Access Structural and functional analytics for community detection in largescale complex networks Pravin Chopade 1* and
More informationSemantic Search in Portals using Ontologies
Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering  IME/RJ Department of Computer Engineering  Rio de Janeiro  Brazil [awallace,anamoura]@de9.ime.eb.br
More informationDOT: A Matrix Model for Analyzing, Optimizing and Deploying Software for Big Data Analytics in Distributed Systems
DOT: A Matrix Model for Analyzing, Optimizing and Deploying Software for Big Data Analytics in Distributed Systems Yin Huai 1 Rubao Lee 1 Simon Zhang 2 Cathy H Xia 3 Xiaodong Zhang 1 1,3Department of Computer
More informationAnomaly Detection with Virtual Service Migration in Cloud Infrastructures
Institut für Technische Informatik und Kommunikationsnetze Kirila Adamova Anomaly Detection with Virtual Service Migration in Cloud Infrastructures Master Thesis 2638L October 22 to March 23 Tutor: Dr.
More information