Exploring Different Aspects of Social Network Analysis Using Web Mining Techniques

Exploring Different Aspects of Social Network Analysis Using Web Mining Techniques 1. Hilal Ahmad Khanday, Dr. Rana Hashmy 2 1, 2 Department of Computer Sciences, University of Kashmir Abstract A social network is a set of people connected by a set of social relations. Thanks to the availability of increasing real-world social network data, social networks are receiving increasing attention by scientific community. The purpose of this paper is to explore multiple aspects of Social network Analysis using data mining techniques to the World Wide Web, referred to as Web mining. In particular we will inquire whether there is any other way of matrix representation of social networks other than Adjacency Matrix. Light will be shed on properties of social networks and there will also be a guided tourof different techniques of web mining. The main goal is to provide a roadmap for researchers who are trying to use data mining techniques for discovering different trends in social network data. Keywords-SocialNetwork, Social Network Analysis, Web Mining 1. INTRODUCTION With the advent of Web 2.0, we have been able to move from Closed, Individual Publishing, One-Way Communication, Passive Involvement, Read-Only Content & Personal Websites to Collaborative, Group Participation, Two-Way Communication, Active Involvement, User-Generated Content & Blogging. In reality we have moved from Double click to Google AdSense, from Britannica Online to Wikipedia, from publishing to participation, from personal websites to Blogging, from page views to cost per click etc. etc. Social Networks Analysis has acquired huge popularity and signify one of the most important social and Computer Science phenomena of these years. This has happened because of many factors including the popularity of online social networks (OSNs), availability of large volumes of OSN log data, representation and analysis of social networks as graphs, and the market interests of social networks. Social networking sites have skyrocketed in popularity in a very short span of time as is depicted by Fig 1. [1] Facebook, Twitter, LinkedIn, Wikipedia, YouTube have been able to make it to the top 15 global websites. Fig 1- Global rank of some major social networking sites. Image was generated dynamically at www.alexa.com/comparison 2. REPRESENTATION Social network analysts use two kinds of tools from mathematics to represent information about patterns of ties among social actors: graphs and matrices. [2]. 2.1Using Graphs Problems in almost every conceivable discipline can be solved using graphic models [3].Network analysis uses (primarily) one kind of graphic display that consists of points (or nodes) to represent actors and lines (or edges) to represent ties or relations. Formally a Graph G = (V, E) consists of V, a nonempty set of vertices (or nodes) and E, a set of edges. Each edge has either one or two vertices associated with it, called its endpoints. An edge is said to connect its endpoints. Statistically, a graph can be characterized by derived values such as the average degree of the nodes and the average path length between nodes. Additional characteristics are the graphs diameter, the number of triangles, the number of isomorphism s and the clustering coefficient, among others. We can use graph models to represent various relationships between people. We can use simple graph to represent whether two people know each other. Each person in a particular group of people is represented by a vertex. An undirected edge is used to connect two people when these two people know each other. No multiple edges and usually no loops are used. (If we want to include self-knowledge, we would include loops.) An undirected graph can be used in some social networking sites like Facebook etc., while as directed edges can replace the connections between two nodes in case of websites like Twitter, where there is concept of following. Volume 4, Issue 2, March April 2015 Page 121

The acquaintanceship graph of all the people in the world has seven billion vertices and probably more than One trillion edges!many social scientists have conjectured that almost every pair of people in the world are linked by a small chain of people, perhaps containing just six or fewer people. This would mean that almost every pair of vertices in the acquaintanceship graph containing all the people in the world is linked by a path of length not exceeding four. The play Six Degrees of Separation is based on this notion 2.2Using Matrices The most common form of representation of data in social network analysis is Matrix form as it is most convenient and suitable for many mathematical operations. We can convert graphs into matrices and vice versa. The information contained in a graph G = (V, E) can be stored in several ways, for example, using the matrix form. The most common way to store a graph is using the adjacency matrix. The adjacency matrix, denoted by A, contains entries which indicate whether two vertices are adjacent or not. An adjacency matrix is also called a Sociomatrix. The matrix A of size n x n can efficiently describe an unweighted undirected graph G = (V, E) containing n vertices. Rows and columns of the sociomatrix both represent the index of each vertex in the graph, and are labelled as 1, 2 n. Each entry a ij in the sociomatrix represents if the indicated pair of vertices a i and a j are adjacent or not. Usually, there is a 1 in the (i, j)th cell if there is an edge connecting a i and a j in the graph, or a 0 otherwise. Thus, if vertices a i and a j are adjacent, a ij = 1, otherwise a ij = 0. Because the graph is undirected, the matrix is symmetric respect to its diagonal, thus a ij = a ji, v i j. Sociomatrices are widely adopted for storing undirected network structures because of some particular properties; for example, social networks arise to sparse sociomatrices, thus it is convenient to adopt techniques of compact matrix decomposition for efficiently storing data[4]. Another possible representation of an undirected graph G = (V, E) through a matrix is called incidence matrix, usually denoted by I. It stores which edges are incident with which vertices, indexing the former on the columns and the latters on the rows, thus the dimension of the matrix I is VxE. A matrix entry I ij contains 1 if the vertex v i is incident with the edge e j, or 0 otherwise. Both the incidence and the adjacent matricescontains all the information required to describe the represented graph. Since Incidence matrix does not necessary has to be a square matrix unlike adjacency matrix which is always a square matrix, therebyproviding enough flexibility. Besides adjacency matrix of simple graphs is always symmetric which may not be the case with incidence matrices. In addition when a simple graph consists of relatively few edges, i.e. when it is sparse, it is preferable to use adjacency list rather than adjacency matrix. For example, if there is a graph with n vertices and each vertex has degree less than or equal to c where c is a constant much smaller than n, then each adjacency list contains vertices less than or equal to c. Hence there are no more than cn items in the adjacency list while as the adjacent matrix for the same graph will consist of n 2 entries which is bigger than cn, thereby using more memory space. Keeping all the above points in mind, considerable effort needs to be done towards analyzing Adjacency lists and Incidence matrices as the possible candidates of data structures for representing graphs of social networks. 3. PROPERTIES OF SOCIAL NETWORKS There are some properties of social networks that are very important. Thefirst three properties are deliberated from [5]. 3.1 Diameter There are short chains of friends that connect a large fraction of pairs of people in a social network. It is wellknown that most real-world graphs exhibit a relatively small diameter. A graph has diameter D if every pair of nodes can be connected by a path of length of at most D edges. Unfortunately, the literature on social networks tends to use the word diameter ambiguously in reference to at least four different quantities: 1) The longest shortest-path length, which is the true graph theoretic diameter but which is infinite in disconnected networks. 2) The longest shortest-path length between connected nodes, which is always finite but which cannot distinguish the complete graph from a graph with a solitary edge. 3) The average shortest-path length, and 4) The average shortest-path length between connected nodes. 3.2 Navigability Social networks exhibit small-world phenomenon based on small-world model defined by Watts, Dodds, and Newman [6] in which navigation is also possible. Their model is based upon multiple hierarchies (geography, occupation, hobbies, etc.) into which people fall, and a greedy algorithm that attempts to get closer to the target in any dimension at every step.simulations have shown this algorithm and model to allow navigation, but no theoretical results have been established. Social Networks are navigable small worlds: not only do there exist short paths connecting most pairs of people, but using only local information and some knowledge of global structure, the people in the network are able to construct short paths to the target. 3.3 Clustering Coefficient Informally, the clustering coefficient of a network is a measure of the probability that two people who have a common friend will themselves be friends. The clustering coefficient for a node u V of a graph is the fraction of edges that exist within the neighbourhood of u, i.e., between two nodes adjacent to u. The clustering coefficient of the entire network is the average clustering coefficient taken over all nodes in the graph. Volume 4, Issue 2, March April 2015 Page 122

3.4 Centrality and Power All sociologists would agree that power is a fundamental property of social structures. There is much less agreement about what power is, and how we can describe and analyse its causes and consequences. Below we summarise some of the main approaches that social network analysis has developed to study power, and the closely related concept of centrality. A) Degree: Degree is the number of ties for an actor, where a tie connects two or more nodes in a graph. Ties can be direct or indirect. Many human behaviours such as advice seeking, information sharing are direct ties while comemberships are examples of undirected ties [7]. B) Closeness: The degree an individual is near all other individuals in a network (directly or indirectly). It reflects the ability to access information through the grapevine of network members. Thus closeness is the inverse of the sum of the shortest distances between each individual and every other person in the network. C) Betweenness: The extent to which a node lies between other nodes in a network. This measure takes into account the connectivity of the node s neighbours, giving a higher value for nodes which bridge clusters. D) Density: It is the measure of the closeness of a network. Given a number of nodes, the more links between them, the larger the density. If the number of nodes in a network is n, and the number of links is l, then its density is given by: for directed graphs and for undirected graphs 4. WEB MINING As a large and dynamic information source that is structurally complex and ever growing, social networks are a fertile ground for data mining principles, or Web mining. The Web mining field encompasses a wide array of issues, primarily aimed at deriving actionable knowledge from the Web, and includes researchers frominformation retrieval,database technologies, and artificial intelligence [8]. Since Oren Etzioni [9], among others, formally introduced the term, authors have used Web mining to mean slightly different things. For example, Jaideep Srivastava and colleagues [10] define it as: The application of data-mining techniques to extract knowledge from Web data, in which at least one of structure or usage (Web log) data is used in the mining process (with or without other types of Web data). Web Mining consists of three main categories according to the web data used as input in Web Data Mining. Web Content Mining;Web Structure Mining andweb Usage Mining 4.1 Web Content Mining Web content mining is the application of data mining techniques to content published on the Internet, usually as HTML (semi structured), plaintext (unstructured), or XML (structured) documents. In other words it may be defined as the procedure of retrieving the information from the web into more structured forms and indexing the information to retrieve it quickly [11]. Table 1 summarizes the concept of web content mining. Table 1 Web Content Mining 4.2 Web Structure Mining Web structure mining operates on the Web s hyperlink structure. This graph structure can provide information about a page s ranking [12] or authoritativeness [13] and enhance search results through filtering. In other words Web structure mining may be defined as the process by which we discover the model of link structure of the web pages. Its aim is to generate structured abstract about the website and web page. Table 2summarises web structure mining. Table 2 WEB STRUCTURE MINING Volume 4, Issue 2, March April 2015 Page 123

4.3 Web Usage Mining Web usage mining analyses results of user interactions with a Web server, including Web logs, clickstreams, and database transactions at a Web site or a group of related sites. It is used to identify the browsing patterns by analysing the navigational behaviour of user. Web usage mining tries to make sense of the data generated by the Web surfer's sessions or behaviours whereas Web-content mining and Web-structure mining utilize real or primary data on the Web. Web usage mining introduces privacy concerns and is currently the topic of extensive debate. Table 3 WEB USAGE MINING 5. WEB MINING TECHNIQUES FOR SOCIAL NETWORK ANALYSIS 5.1 Association Rules Association rules are if/then statements that help uncover relationships between seemingly unrelated data in a transactional database, relational database or other information repository. Association rule mining was first introduced in [14]. Association rule mining represents a data mining technique, the goal of which is to find interesting relationships among a large set of data items. It targets to extract fascinating correlations, recurrent patterns, associations or unpremeditated structures among sets of items in the transaction databases or other data repository. A survey of Association Rule mining is presented in [15]. The simple marginal and conditional probabilities are insufficient to tell us about causal relationships, more sophisticated techniques are required. Association rule mining is easy to use and implement. The patterns discovered with this data mining technique can be represented in the form of association rules. Rule support and confidence are two measures of rule interestingness. Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold. Such thresholds can be set by users or domain experts. In social network analysis, association rule mining can help discover the hidden relationships between the different nodes of a network 5.2 Classification Classification is to build (automatically) a model that can classify a class of objects so as to predict the classification or missing attribute value of future objects (whose class may not be known) [16]. It is a two-step process. In the first step, based on the collection of training data set, a model is constructed to describe the characteristics of a set of data classes or concepts. Since data classes or concepts are predefined, this step is also known as supervised learning (i.e., which class the training sample belongs to is provided). In the second step, the model is used to predict the classes of future objects or data There are many techniques for classification. Classification by decision tree has been wellresearched and lot of algorithms have been developed. A complete survey of Classification using decision trees is given in [17]. Bayesian classification is another technique that can be found in [18].Nearest neighbour methods are also discussed in many statistical texts on classification, such as [19].Many other machine learning and neural network techniquesare used for constructing the classification models. 5.3 Clustering Clustering is similar to classification but it is an unsupervised learning process while as Classification is a supervised learning process. Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects, so that objects within the same cluster must be similar to some extent, also they should be dissimilar to those objects in other clusters [16]. In classification which record belongs to which class is predefined, while as in clustering there is no predefined classes. In clustering, objects are grouped together based on their similarities. Similarity between objects are defined by similarity functions, usually similarities are quantitatively specified as distance or other measures by corresponding domain experts. A survey of clustering techniques and algorithms can be found in [20]. In social network analysis, discovering the closest people in the network is usually the main mission, and is generally achieved by using a visualization technique in a small social network. Thus clustering may emerge a potential technique for identifying more clusters and groups in large social networks. Besides it can offer more meticulous information than visualisation [21], including the closeness of a group, detailed information of members in a group and the relationship between groups in a social network. 6. CONCLUSION AND FUTURE RESEARCH As an approach to social network research, social network analysis displays four features: structural intuition, systematic relational data, graphic images, and mathematical or computational models [22]. Here we tried to present a holistic approach by considering all of the Volume 4, Issue 2, March April 2015 Page 124

above features. This paper studies the formal methods to represent social networks, and the various properties of these networks. In particular representation of social networks using Incidence matrices was considered from mathematical perspective. In order to come to a formal conclusion, lotof research needs to be done by comparingadjacent matrices and lists with incidence matricesusing graphs obtained from online social networks as the input data, which will be among our future work.besides that the computational cost of association rule mining represents a disadvantage and future work will focus on reducing it. Finally we hope that this study should help one to understand social networks and should enrich the studies of applications of web mining techniques on these networks. REFERENCES [1] Alexa, Alexa the Web Information Company, 2014. [OnlineAccessed on Dec. 25, 2014]. Available: www.alexa.com/comparison/ 2014. [2] A. Hanneman and M. Riddle, Introduction to Social Network Methods, [Online].Available: http://www.faculty.ucr.edu/~hanneman/nettext/ 2005. [3] Kenneth H. Rosen, Discrete Mathematics & its Applications with Combinatorics and Graph Theory, TMH, 2012. [4] V. Snasel, Z. Horak, J. Kocibova, A. Abraham, Reducing social network dimensions using matrix factorization methods, International Conference on Advances in Social Network Analysis and Mining, pp. 348 351, IEEE, 2009. [5] D. Liben-Nowell, An Algorithmic Approach to Social Networks, Ph. D Thesis, Massachusetts Institute of Technology, June 2005. [6] D. J. Watts, P. Sheridan Dodds, and M. E. J. Newman, Identity and Search in Social Networks, Science, 296:1302-1305, 17 May 2002. [7] G. Plickert, R. Cote, B. Wellman, It s Not Who You Know. It s How You Know Them: Who Exchanges What with Whom?, Social Networks, Vol. 29, No. 3, pp.405-429, 2007 [8] P. Kolari, A. Joshi, Web Mining: Research and Practice, IEE CS, July- August, 2004 [9] O. Etzioni, The World Wide Web: Quagmire or Gold Mine?,Comm. ACM, vol. 39, no.11, pp. 65 68, 1996. [10] J. Srivastava, P. Desikan, and V. Kumar, Web Mining: Accomplishments and Future Directions, Proc. US National Science Foundation Workshop on Next-Generation Data Mining (NGDM), National Science Foundation, 2002. [11] Z. S. Zubi, Ranking Web Pages Using Web Structure Mining Concepts, Recent Advances in Telecommunications, Signals and Systems, 2010. [12] L. Page et al., The PageRank Citation Ranking: Bring Order to the Web, Tech. report, Stanford Digital Library Technologies, Jan. 1998. [13] J. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Proc. 9th Ann. ACM SIAM Symp. Discrete Algorithms, ACM Press, pp. 668 677. 1998. [14] R.Agrawal, T. Imielinski, and A.N. Swami, Mining Association Rules between Sets of Items in large Databases, In Proceedings of the ACM SIGMOD International Conference on Management of Data,. Washington, D.C., 207 216. 1993. [15] Q. Zhao, S. Bhowmick, Association Rule Mining: A Survey, No. 2003116, Technical Report, CAIS, Nanyang Technological University, Singapore, 2003. [16] J. Han, M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann, 2000. [17] S. Murthy, Automatic construction of decision trees from data: A multi-disciplinary Survey, Data Mining and Knowledge Discovery 2, 4, 345 389, 1998. [18] R. Duda, T. Hart, Pattern Classification and Scene Analysis., Wiley & Sons, Inc., 1973. [19] M. James, Classification Algorithms, Wiley & Sons, Inc., 1985. [20] P. Berkhin, Survey of clustering data mining techniques Technical Report, Accrue Software, San Jose, CA, 2002. [21] B. Tatemura, Y.Wu, Tomographic Clustering to Visualize Blog Communities as Mountain Views, In Proc. Of WWW Conference, Japan, 10-14 May, 2005. [22] F. Borko, Handbook of Social Network Technologies and Applications, Springer Publications, 2010. AUTHOR Hilal Ahmad Khanday received his Bachelors and Masters in Computers from University of Kashmir and IUST respectively. He qualified many national examinations conducted by UGC, MHRD and is currently a research scholar and faculty member at the University of Kashmir. Dr. Rana Hashmy is Scientist C at University of Kashmir. She received a Ph.D. (Computer Science) from JawaharlalNehru University, Delhi (INDIA). Her areas of research interest include Software Engineering., Data warehousing, and Data mining.she has many research publications related to her field of interest. Volume 4, Issue 2, March April 2015 Page 125