Exploring Different Aspects of Social Network Analysis Using Web Mining Techniques



Similar documents
An Overview of Knowledge Discovery Database and Data mining Techniques

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Practical Graph Mining with R. 5. Link Analysis

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Data Mining Solutions for the Business Environment

Social Media Mining. Network Measures

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

Comparison of K-means and Backpropagation Data Mining Algorithms

Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics

Social Media Mining. Data Mining Essentials

MapReduce Approach to Collective Classification for Networks

DATA ANALYSIS II. Matrix Algorithms

ANALYSIS OF WEBSITE USAGE WITH USER DETAILS USING DATA MINING PATTERN RECOGNITION

Role of Social Networking in Marketing using Data Mining

A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS INTRODUCTION

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

The Open University s repository of research publications and other research outputs

Graph Mining and Social Network Analysis

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Social Network Discovery based on Sensitivity Analysis

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

Sanjeev Kumar. contribute

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

Web Mining using Artificial Ant Colonies : A Survey

: Introduction to Machine Learning Dr. Rita Osadchy

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

SGL: Stata graph library for network analysis

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Exploring Big Data in Social Networks

Information Management course

Data Mining and Analysis of Online Social Networks

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Machine Learning using MapReduce

APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS.

Specific Usage of Visual Data Analysis Techniques

Zachary Monaco Georgia College Olympic Coloring: Go For The Gold

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113

FREQUENT PATTERN MINING FOR EFFICIENT LIBRARY MANAGEMENT

Sentiment Analysis on Big Data

Using Data Mining for Mobile Communication Clustering and Characterization

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

An Empirical Study of Application of Data Mining Techniques in Library System

Prediction of Heart Disease Using Naïve Bayes Algorithm

Social Media Mining. Graph Essentials

Application of Markov chain analysis to trend prediction of stock indices Milan Svoboda 1, Ladislav Lukáš 2

A Survey on Web Mining From Web Server Log

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Big Data: Rethinking Text Visualization

2015 Workshops for Professors

Statistical Models in Data Mining

Self Organizing Maps for Visualization of Categories

Single Level Drill Down Interactive Visualization Technique for Descriptive Data Mining Results

Top Top 10 Algorithms in Data Mining

HISTORICAL DEVELOPMENTS AND THEORETICAL APPROACHES IN SOCIOLOGY Vol. I - Social Network Analysis - Wouter de Nooy

New Matrix Approach to Improve Apriori Algorithm

Visualizing e-government Portal and Its Performance in WEBVS

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

Visualization methods for patent data

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

A New Approach for Evaluation of Data Mining Techniques

How To Find Influence Between Two Concepts In A Network

COURSE RECOMMENDER SYSTEM IN E-LEARNING

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Chapter 29 Scale-Free Network Topologies with Clustering Similar to Online Social Networks

DATA MINING CONCEPTS AND TECHNIQUES. Marek Maurizio E-commerce, winter 2011

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Part 2: Community Detection

AN EFFICIENT APPROACH TO PERFORM PRE-PROCESSING

How To Use Neural Networks In Data Mining

High-dimensional labeled data analysis with Gabriel graphs

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

Data quality in Accounting Information Systems

Association rules for improving website effectiveness: case analysis

An Introduction to APGL

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

Data mining in the e-learning domain

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS

A SURVEY ON WEB MINING TOOLS

Mining for Web Engineering

How To Solve The Kd Cup 2010 Challenge

Data Mining: Algorithms and Applications Matrix Math Review

Quality Control of National Genetic Evaluation Results Using Data-Mining Techniques; A Progress Report

Network-Based Tools for the Visualization and Analysis of Domain Models

Mining Social-Network Graphs

Transcription:

Exploring Different Aspects of Social Network Analysis Using Web Mining Techniques 1. Hilal Ahmad Khanday, Dr. Rana Hashmy 2 1, 2 Department of Computer Sciences, University of Kashmir Abstract A social network is a set of people connected by a set of social relations. Thanks to the availability of increasing real-world social network data, social networks are receiving increasing attention by scientific community. The purpose of this paper is to explore multiple aspects of Social network Analysis using data mining techniques to the World Wide Web, referred to as Web mining. In particular we will inquire whether there is any other way of matrix representation of social networks other than Adjacency Matrix. Light will be shed on properties of social networks and there will also be a guided tourof different techniques of web mining. The main goal is to provide a roadmap for researchers who are trying to use data mining techniques for discovering different trends in social network data. Keywords-SocialNetwork, Social Network Analysis, Web Mining 1. INTRODUCTION With the advent of Web 2.0, we have been able to move from Closed, Individual Publishing, One-Way Communication, Passive Involvement, Read-Only Content & Personal Websites to Collaborative, Group Participation, Two-Way Communication, Active Involvement, User-Generated Content & Blogging. In reality we have moved from Double click to Google AdSense, from Britannica Online to Wikipedia, from publishing to participation, from personal websites to Blogging, from page views to cost per click etc. etc. Social Networks Analysis has acquired huge popularity and signify one of the most important social and Computer Science phenomena of these years. This has happened because of many factors including the popularity of online social networks (OSNs), availability of large volumes of OSN log data, representation and analysis of social networks as graphs, and the market interests of social networks. Social networking sites have skyrocketed in popularity in a very short span of time as is depicted by Fig 1. [1] Facebook, Twitter, LinkedIn, Wikipedia, YouTube have been able to make it to the top 15 global websites. Fig 1- Global rank of some major social networking sites. Image was generated dynamically at www.alexa.com/comparison 2. REPRESENTATION Social network analysts use two kinds of tools from mathematics to represent information about patterns of ties among social actors: graphs and matrices. [2]. 2.1Using Graphs Problems in almost every conceivable discipline can be solved using graphic models [3].Network analysis uses (primarily) one kind of graphic display that consists of points (or nodes) to represent actors and lines (or edges) to represent ties or relations. Formally a Graph G = (V, E) consists of V, a nonempty set of vertices (or nodes) and E, a set of edges. Each edge has either one or two vertices associated with it, called its endpoints. An edge is said to connect its endpoints. Statistically, a graph can be characterized by derived values such as the average degree of the nodes and the average path length between nodes. Additional characteristics are the graphs diameter, the number of triangles, the number of isomorphism s and the clustering coefficient, among others. We can use graph models to represent various relationships between people. We can use simple graph to represent whether two people know each other. Each person in a particular group of people is represented by a vertex. An undirected edge is used to connect two people when these two people know each other. No multiple edges and usually no loops are used. (If we want to include self-knowledge, we would include loops.) An undirected graph can be used in some social networking sites like Facebook etc., while as directed edges can replace the connections between two nodes in case of websites like Twitter, where there is concept of following. Volume 4, Issue 2, March April 2015 Page 121

The acquaintanceship graph of all the people in the world has seven billion vertices and probably more than One trillion edges!many social scientists have conjectured that almost every pair of people in the world are linked by a small chain of people, perhaps containing just six or fewer people. This would mean that almost every pair of vertices in the acquaintanceship graph containing all the people in the world is linked by a path of length not exceeding four. The play Six Degrees of Separation is based on this notion 2.2Using Matrices The most common form of representation of data in social network analysis is Matrix form as it is most convenient and suitable for many mathematical operations. We can convert graphs into matrices and vice versa. The information contained in a graph G = (V, E) can be stored in several ways, for example, using the matrix form. The most common way to store a graph is using the adjacency matrix. The adjacency matrix, denoted by A, contains entries which indicate whether two vertices are adjacent or not. An adjacency matrix is also called a Sociomatrix. The matrix A of size n x n can efficiently describe an unweighted undirected graph G = (V, E) containing n vertices. Rows and columns of the sociomatrix both represent the index of each vertex in the graph, and are labelled as 1, 2 n. Each entry a ij in the sociomatrix represents if the indicated pair of vertices a i and a j are adjacent or not. Usually, there is a 1 in the (i, j)th cell if there is an edge connecting a i and a j in the graph, or a 0 otherwise. Thus, if vertices a i and a j are adjacent, a ij = 1, otherwise a ij = 0. Because the graph is undirected, the matrix is symmetric respect to its diagonal, thus a ij = a ji, v i j. Sociomatrices are widely adopted for storing undirected network structures because of some particular properties; for example, social networks arise to sparse sociomatrices, thus it is convenient to adopt techniques of compact matrix decomposition for efficiently storing data[4]. Another possible representation of an undirected graph G = (V, E) through a matrix is called incidence matrix, usually denoted by I. It stores which edges are incident with which vertices, indexing the former on the columns and the latters on the rows, thus the dimension of the matrix I is VxE. A matrix entry I ij contains 1 if the vertex v i is incident with the edge e j, or 0 otherwise. Both the incidence and the adjacent matricescontains all the information required to describe the represented graph. Since Incidence matrix does not necessary has to be a square matrix unlike adjacency matrix which is always a square matrix, therebyproviding enough flexibility. Besides adjacency matrix of simple graphs is always symmetric which may not be the case with incidence matrices. In addition when a simple graph consists of relatively few edges, i.e. when it is sparse, it is preferable to use adjacency list rather than adjacency matrix. For example, if there is a graph with n vertices and each vertex has degree less than or equal to c where c is a constant much smaller than n, then each adjacency list contains vertices less than or equal to c. Hence there are no more than cn items in the adjacency list while as the adjacent matrix for the same graph will consist of n 2 entries which is bigger than cn, thereby using more memory space. Keeping all the above points in mind, considerable effort needs to be done towards analyzing Adjacency lists and Incidence matrices as the possible candidates of data structures for representing graphs of social networks. 3. PROPERTIES OF SOCIAL NETWORKS There are some properties of social networks that are very important. Thefirst three properties are deliberated from [5]. 3.1 Diameter There are short chains of friends that connect a large fraction of pairs of people in a social network. It is wellknown that most real-world graphs exhibit a relatively small diameter. A graph has diameter D if every pair of nodes can be connected by a path of length of at most D edges. Unfortunately, the literature on social networks tends to use the word diameter ambiguously in reference to at least four different quantities: 1) The longest shortest-path length, which is the true graph theoretic diameter but which is infinite in disconnected networks. 2) The longest shortest-path length between connected nodes, which is always finite but which cannot distinguish the complete graph from a graph with a solitary edge. 3) The average shortest-path length, and 4) The average shortest-path length between connected nodes. 3.2 Navigability Social networks exhibit small-world phenomenon based on small-world model defined by Watts, Dodds, and Newman [6] in which navigation is also possible. Their model is based upon multiple hierarchies (geography, occupation, hobbies, etc.) into which people fall, and a greedy algorithm that attempts to get closer to the target in any dimension at every step.simulations have shown this algorithm and model to allow navigation, but no theoretical results have been established. Social Networks are navigable small worlds: not only do there exist short paths connecting most pairs of people, but using only local information and some knowledge of global structure, the people in the network are able to construct short paths to the target. 3.3 Clustering Coefficient Informally, the clustering coefficient of a network is a measure of the probability that two people who have a common friend will themselves be friends. The clustering coefficient for a node u V of a graph is the fraction of edges that exist within the neighbourhood of u, i.e., between two nodes adjacent to u. The clustering coefficient of the entire network is the average clustering coefficient taken over all nodes in the graph. Volume 4, Issue 2, March April 2015 Page 122

3.4 Centrality and Power All sociologists would agree that power is a fundamental property of social structures. There is much less agreement about what power is, and how we can describe and analyse its causes and consequences. Below we summarise some of the main approaches that social network analysis has developed to study power, and the closely related concept of centrality. A) Degree: Degree is the number of ties for an actor, where a tie connects two or more nodes in a graph. Ties can be direct or indirect. Many human behaviours such as advice seeking, information sharing are direct ties while comemberships are examples of undirected ties [7]. B) Closeness: The degree an individual is near all other individuals in a network (directly or indirectly). It reflects the ability to access information through the grapevine of network members. Thus closeness is the inverse of the sum of the shortest distances between each individual and every other person in the network. C) Betweenness: The extent to which a node lies between other nodes in a network. This measure takes into account the connectivity of the node s neighbours, giving a higher value for nodes which bridge clusters. D) Density: It is the measure of the closeness of a network. Given a number of nodes, the more links between them, the larger the density. If the number of nodes in a network is n, and the number of links is l, then its density is given by: for directed graphs and for undirected graphs 4. WEB MINING As a large and dynamic information source that is structurally complex and ever growing, social networks are a fertile ground for data mining principles, or Web mining. The Web mining field encompasses a wide array of issues, primarily aimed at deriving actionable knowledge from the Web, and includes researchers frominformation retrieval,database technologies, and artificial intelligence [8]. Since Oren Etzioni [9], among others, formally introduced the term, authors have used Web mining to mean slightly different things. For example, Jaideep Srivastava and colleagues [10] define it as: The application of data-mining techniques to extract knowledge from Web data, in which at least one of structure or usage (Web log) data is used in the mining process (with or without other types of Web data). Web Mining consists of three main categories according to the web data used as input in Web Data Mining. Web Content Mining;Web Structure Mining andweb Usage Mining 4.1 Web Content Mining Web content mining is the application of data mining techniques to content published on the Internet, usually as HTML (semi structured), plaintext (unstructured), or XML (structured) documents. In other words it may be defined as the procedure of retrieving the information from the web into more structured forms and indexing the information to retrieve it quickly [11]. Table 1 summarizes the concept of web content mining. Table 1 Web Content Mining 4.2 Web Structure Mining Web structure mining operates on the Web s hyperlink structure. This graph structure can provide information about a page s ranking [12] or authoritativeness [13] and enhance search results through filtering. In other words Web structure mining may be defined as the process by which we discover the model of link structure of the web pages. Its aim is to generate structured abstract about the website and web page. Table 2summarises web structure mining. Table 2 WEB STRUCTURE MINING Volume 4, Issue 2, March April 2015 Page 123

4.3 Web Usage Mining Web usage mining analyses results of user interactions with a Web server, including Web logs, clickstreams, and database transactions at a Web site or a group of related sites. It is used to identify the browsing patterns by analysing the navigational behaviour of user. Web usage mining tries to make sense of the data generated by the Web surfer's sessions or behaviours whereas Web-content mining and Web-structure mining utilize real or primary data on the Web. Web usage mining introduces privacy concerns and is currently the topic of extensive debate. Table 3 WEB USAGE MINING 5. WEB MINING TECHNIQUES FOR SOCIAL NETWORK ANALYSIS 5.1 Association Rules Association rules are if/then statements that help uncover relationships between seemingly unrelated data in a transactional database, relational database or other information repository. Association rule mining was first introduced in [14]. Association rule mining represents a data mining technique, the goal of which is to find interesting relationships among a large set of data items. It targets to extract fascinating correlations, recurrent patterns, associations or unpremeditated structures among sets of items in the transaction databases or other data repository. A survey of Association Rule mining is presented in [15]. The simple marginal and conditional probabilities are insufficient to tell us about causal relationships, more sophisticated techniques are required. Association rule mining is easy to use and implement. The patterns discovered with this data mining technique can be represented in the form of association rules. Rule support and confidence are two measures of rule interestingness. Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold. Such thresholds can be set by users or domain experts. In social network analysis, association rule mining can help discover the hidden relationships between the different nodes of a network 5.2 Classification Classification is to build (automatically) a model that can classify a class of objects so as to predict the classification or missing attribute value of future objects (whose class may not be known) [16]. It is a two-step process. In the first step, based on the collection of training data set, a model is constructed to describe the characteristics of a set of data classes or concepts. Since data classes or concepts are predefined, this step is also known as supervised learning (i.e., which class the training sample belongs to is provided). In the second step, the model is used to predict the classes of future objects or data There are many techniques for classification. Classification by decision tree has been wellresearched and lot of algorithms have been developed. A complete survey of Classification using decision trees is given in [17]. Bayesian classification is another technique that can be found in [18].Nearest neighbour methods are also discussed in many statistical texts on classification, such as [19].Many other machine learning and neural network techniquesare used for constructing the classification models. 5.3 Clustering Clustering is similar to classification but it is an unsupervised learning process while as Classification is a supervised learning process. Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects, so that objects within the same cluster must be similar to some extent, also they should be dissimilar to those objects in other clusters [16]. In classification which record belongs to which class is predefined, while as in clustering there is no predefined classes. In clustering, objects are grouped together based on their similarities. Similarity between objects are defined by similarity functions, usually similarities are quantitatively specified as distance or other measures by corresponding domain experts. A survey of clustering techniques and algorithms can be found in [20]. In social network analysis, discovering the closest people in the network is usually the main mission, and is generally achieved by using a visualization technique in a small social network. Thus clustering may emerge a potential technique for identifying more clusters and groups in large social networks. Besides it can offer more meticulous information than visualisation [21], including the closeness of a group, detailed information of members in a group and the relationship between groups in a social network. 6. CONCLUSION AND FUTURE RESEARCH As an approach to social network research, social network analysis displays four features: structural intuition, systematic relational data, graphic images, and mathematical or computational models [22]. Here we tried to present a holistic approach by considering all of the Volume 4, Issue 2, March April 2015 Page 124

above features. This paper studies the formal methods to represent social networks, and the various properties of these networks. In particular representation of social networks using Incidence matrices was considered from mathematical perspective. In order to come to a formal conclusion, lotof research needs to be done by comparingadjacent matrices and lists with incidence matricesusing graphs obtained from online social networks as the input data, which will be among our future work.besides that the computational cost of association rule mining represents a disadvantage and future work will focus on reducing it. Finally we hope that this study should help one to understand social networks and should enrich the studies of applications of web mining techniques on these networks. REFERENCES [1] Alexa, Alexa the Web Information Company, 2014. [OnlineAccessed on Dec. 25, 2014]. Available: www.alexa.com/comparison/ 2014. [2] A. Hanneman and M. Riddle, Introduction to Social Network Methods, [Online].Available: http://www.faculty.ucr.edu/~hanneman/nettext/ 2005. [3] Kenneth H. Rosen, Discrete Mathematics & its Applications with Combinatorics and Graph Theory, TMH, 2012. [4] V. Snasel, Z. Horak, J. Kocibova, A. Abraham, Reducing social network dimensions using matrix factorization methods, International Conference on Advances in Social Network Analysis and Mining, pp. 348 351, IEEE, 2009. [5] D. Liben-Nowell, An Algorithmic Approach to Social Networks, Ph. D Thesis, Massachusetts Institute of Technology, June 2005. [6] D. J. Watts, P. Sheridan Dodds, and M. E. J. Newman, Identity and Search in Social Networks, Science, 296:1302-1305, 17 May 2002. [7] G. Plickert, R. Cote, B. Wellman, It s Not Who You Know. It s How You Know Them: Who Exchanges What with Whom?, Social Networks, Vol. 29, No. 3, pp.405-429, 2007 [8] P. Kolari, A. Joshi, Web Mining: Research and Practice, IEE CS, July- August, 2004 [9] O. Etzioni, The World Wide Web: Quagmire or Gold Mine?,Comm. ACM, vol. 39, no.11, pp. 65 68, 1996. [10] J. Srivastava, P. Desikan, and V. Kumar, Web Mining: Accomplishments and Future Directions, Proc. US National Science Foundation Workshop on Next-Generation Data Mining (NGDM), National Science Foundation, 2002. [11] Z. S. Zubi, Ranking Web Pages Using Web Structure Mining Concepts, Recent Advances in Telecommunications, Signals and Systems, 2010. [12] L. Page et al., The PageRank Citation Ranking: Bring Order to the Web, Tech. report, Stanford Digital Library Technologies, Jan. 1998. [13] J. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Proc. 9th Ann. ACM SIAM Symp. Discrete Algorithms, ACM Press, pp. 668 677. 1998. [14] R.Agrawal, T. Imielinski, and A.N. Swami, Mining Association Rules between Sets of Items in large Databases, In Proceedings of the ACM SIGMOD International Conference on Management of Data,. Washington, D.C., 207 216. 1993. [15] Q. Zhao, S. Bhowmick, Association Rule Mining: A Survey, No. 2003116, Technical Report, CAIS, Nanyang Technological University, Singapore, 2003. [16] J. Han, M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann, 2000. [17] S. Murthy, Automatic construction of decision trees from data: A multi-disciplinary Survey, Data Mining and Knowledge Discovery 2, 4, 345 389, 1998. [18] R. Duda, T. Hart, Pattern Classification and Scene Analysis., Wiley & Sons, Inc., 1973. [19] M. James, Classification Algorithms, Wiley & Sons, Inc., 1985. [20] P. Berkhin, Survey of clustering data mining techniques Technical Report, Accrue Software, San Jose, CA, 2002. [21] B. Tatemura, Y.Wu, Tomographic Clustering to Visualize Blog Communities as Mountain Views, In Proc. Of WWW Conference, Japan, 10-14 May, 2005. [22] F. Borko, Handbook of Social Network Technologies and Applications, Springer Publications, 2010. AUTHOR Hilal Ahmad Khanday received his Bachelors and Masters in Computers from University of Kashmir and IUST respectively. He qualified many national examinations conducted by UGC, MHRD and is currently a research scholar and faculty member at the University of Kashmir. Dr. Rana Hashmy is Scientist C at University of Kashmir. She received a Ph.D. (Computer Science) from JawaharlalNehru University, Delhi (INDIA). Her areas of research interest include Software Engineering., Data warehousing, and Data mining.she has many research publications related to her field of interest. Volume 4, Issue 2, March April 2015 Page 125