Mining Social Media with Social Theories: A Survey

Transcription

1 Mining Media with Theories: A Srvey Jiliang Tang Compter Science & Eng Arizona State University Tempe, AZ, USA [email protected] Yi Chang Yahoo!Labs Yahoo!Inc Snnyvale,CA, USA [email protected] Han Li Compter Science & Eng Arizona State University Tempe, AZ, USA [email protected] ABSTRACT The increasing poplarity of social media encorages more and more sers to participate in varios online activities and prodces data in an nprecedented rate. media data is big, linked, noisy, highly nstrctred and incomplete, and differs from data in traditional data mining, which cltivates a new research field social media mining. theories from social sciences are helpfl to explain social phenomena. The scale and properties of social media data are very different from these of data social sciences se to develop social theories. As a new type of social data, social media data has a fndamental qestion can we apply social theories to social media data? Recent advances in compter science provide necessary comptational tools and techniqes for s to verify social theories on largescale social media data. theories have been applied to mining social media. In this article, we review some key social theories in mining social media, their verification approaches, interesting findings, and stateoftheart algorithms. We also discss some ftre directions in this active area of mining social media with social theories. 1. INTRODUCTION media greatly enables people to participate in online activities and shatters the barrier for online sers to share and consme information in any place at any time. media sers can be both passive content consmers and active content prodcers, and generate data at an nprecedented rate. The natre of social media determines that its data significantly differs from the data in traditional data mining. relations are pervasively available in social media data, and play important roles in social media sch as mitigating information overload problem [38; 51] and promoting the information propagation process [4; 67]. media data is big, noisy, incomplete, highly nstrctred and linked with social relations. These niqe properties of social media data sggest that naively applying existing techniqes may fail or lead to inappropriate nderstandings abot the data. For example, social media data is linked via social relations and contradicts with the nderlying independent and identically distribted (IID) assmption of the vast majority of existing techniqes [23; 57]. This new type of data calls for novel data mining techniqes for a better nderstanding from the comptational perspective. The stdy and development of these new techniqes are nder Networks Correlation Userbased Relation based Contentbased Commnity Detection User Classification Spammer Detection Balance Theory Link Prediction User Generated Content Tie Prediction Tie Strength Prediction Recommendation Stats Theory Featre Selection Sentiment Analysis Media Data Theories Media Mining Tasks Figre 1: Theories in Media Mining. the prview of social media mining, which is the process of representing, analyzing, and extracting actionable patterns from social media data [70]. There are many social theories developed from social sciences to explain varios types of social phenomena. For example, the homophily theory [40] sggests how individals connect to each other, while balance theory sggests that sers in a social network tend to form into a balanced network strctre [17]. The scale of the data social scientists employ to develop these social theories is very different from that of social media data. It is easy for social media data to inclde the actions and interactions of hndreds of millions of individals in real time as well as over time. Therefore there is a fndamental qestion for this new type of social data can we apply some social theories to social media data?. If we can apply social theories to social media data, social theories can help s nderstand social media data from a social perspective, and combining social theories with comptational methods manifests a novel and effective perspective to mine social media data as shown in Figre 1. theories help bridge the gap from what we have (socialmediadata)towhatwewanttonderstandsocialmedia data (social media mining). Integrating social theories with comptational models becomes an interesting direction in mining social media data and encorages a large body of literatre in this line. The goal of this article is to provide a review of some key social

2 theories in mining social media data. The contribtions and organization of this article are smmarized as below: The social property of social media data determines that it differs from data in traditional data mining and social sciences. In Section 2, we provide an overview of the niqe properties of social media data; An increasing nmber of social theories is verified in social media data. In Section 3, we focs on three key and widely sed social theories with basic concepts, verification approaches and key findings; The fast growing interests and intensifying need to harness social media data make social media mining grow rapidly. Integrating social theories with comptational methods becomes a principled way to mine social media data. In Section 4, we review the stateoftheart algorithms that exploit social theories in mining social media, and smmary featre engineering, constraint generating and objective defining as three ways to explain social theories for comptational models. theories in mining social media data is still an active area of exploration and there cold be more existing social theories to be employed or new social theories to be discovered from this new type of social media data. In Section 5, we discss some open isses and possible research directions. 2. SOCIAL MEDIA DATA IS SOCIAL relations are pervasively available and the social property of social media data determines that social media data is sbstantially different from data in traditional data mining and social sciences. In this section, we discss some niqe properties of social media data. Before details, we first introdce some notations sed in this article. Let U = { 1, 2,..., n} and P = {p 1,p 2,...,p m} be the set of n sers and m items (or pieces of ser generated content). We se S R n n, R R n m and C R m K to denote serser relation, sercontent interaction and contentfeatre matrices where we extract a set of K featres F to represent the content set P. Big: In social media, we have little data for each specific individal. However, the social property of social media data links individals data together, which provides a new type of big data. For example, more than 300 million tweets are sent to Twitter per day; more than 3,000 photos are ploaded to Flickr per minte, and more than 153 million blogs are posted per year. Linked: The availability of social relations determines that social media data is inherently linked [52]. An illstration example is shown in Figre 2 where ser generated content (or p 1 to p 8) are linked via social relations among sers ( 1 to 4). Linked social media data is patently not independent and identically distribted, which contradicts one of the most endring and deeply bried assmptions of traditional data mining and machine learning methods [23; 57]. Noisy: A sccessfl data mining exercise entails extensive data preprocessing and noisy removal as garbage in and garbage ot. However, social media data can contain a large portion of noisy data. Users in social media can be both passive content consmers and active content prodcers, casing the qality of ser generated content to vary Figre 2: Linked Media Data. drastically [1]. The noisy isses of social media data are not stop here. The social networks in social media are also noisy. First some social media sers work as spammers to spread malicios or nwanted messages [47]. Second, the low cost of link formation leads to acqaintances and best friends mixed together [65]. Unstrctred: User generated content in social media is often highly nstrctred. Nowadays more and more sers se their mobile devices to pblish content sch as pdating statses in Facebook, sending tweets in Twitter and commenting on posts, which reslts in (1) short texts and (2) typos and spacing errors occrring very freqently [25]. Freefrom langages are widely adopted by social media sers in the online commnication sch as ASCII art (e.g., :) and :( ) and abbreviations (e.g., h r?) [46]. The short and highly nstrctred social media data challenges the vast majority of existing techniqes. Incomplete: Users attribtes are predictable with their personal data [26]. To address sch privacy concerns, social media services often allow their sers to se their profile settings to mark their personal data sch as demographic profiles, stats pdates, lists of friends, videos, photos, and interactions on posts, invisible to others. For example, a very small portion of Facebook sers (< 1%) make their personal data pblic available [41]. The available social media data cold be incomplete and extremely sparse. For example, for social recommendation, more than 99% of entities in the sercontent interaction matrix R are missed [51]. 3. SOCIAL THEORIES theories from social sciences are sefl to explain varios types of social phenomena. In social media, it is increasingly possible for s to observe social data from hndreds of millions of individals. Given its largescale size and social property, a natral qestion is can we apply social theories to social media data?. More and more social theories have been proven to be applicable to social media data. In this section, we concentrate on three important social theories with basic concepts, ways to verify them and key findings. 3.1 Correlation Theory correlation theory is one of the most important social theories and it sggests that there exist correlations between behaviors or attribtes of adjacent sers in a social network. Homophily, inflence and confonding are three major social process to explain these correlations as shown in Figre 3. Homophily is to explain or tendency to connect to others that share certain similarity with s. For example, birds of a feather flock together.

3 Individal Characteristics Relations Environment Relations Individal Characteristics Individal Characteristics A: Homophily B: Inflence C: Confonding Relations Figre 3: Major Forces of Correlation. Inflence sggests that people tend to follow the behaviors of their friends and adjacent sers are likely to exhibit similar behaviors. For example, if most of one s friends switch to a mobile phone company, he cold be inflenced by them and switch, too Confonding is a correlation between sers that can also be forged de to external inflences from environment. For example, two individals living in the same city are more likely to become friends than two random individals. To help s verify the applicability of social correlation theory to social media data, essentially we need to answer the following qestion are sers with social relations more similar than these withot? To answer this qestion, for each social relation from i to j, we calclate two similarities s ij and r ik where s ij is the similarity between i and j, while r ik is the similarity between i and a randomly chosen ser k who does not connect to i. Let S be the set of s ijs, which denotes the set of similarities of pairs of connected sers. Let R be the set of r ik s, which represents the set of similarities of pairs of randomly chosen sers. We perform a ttest on S and R. The nll hypothesis is that similarities with social relations are no larger than these withot, i.e., H 0 : S R; the alternative hypothesis is that the similarities with social relations are larger than these withot, i.e., H 1 : S > R. If there is strong evidence to reject the nll hypothesis, we verify that social correlation theory is applicable to social media data. Via above verification process, social correlation theory has been proven to be applicable to varios social media sites. Twitter sers with following relations are likely to share similar topics or opinions [63; 20]. Users in Epinions with trst relations are likely to rate same items with similar scores [49]. In [52], we shows that sers in Digg and Blog Category with social relations are likely to joint grops of similar interests. In locationbased social networks sch as Forsqare, sers with social relations are likely to do checkins in the same locations [69; 14]. 3.2 Balance Theory In general, balance theory implies the intition that the friend of my friend is my friend and the enemy of my enemy is my friend [17]. Basically, it considers the balance of signs on a triad involving three sers in a social network with positive and negative links [28; 27]. We se s ij to denote the sign of the relation between i and j where s ij = 1 (or s ij = 1) if we observe a positive relation (or a negative relation) between i and j. Balance theory sggests that + A: (+,+,+) B: (+,+,) 1 1 C: (+,,) D: (,,) Figre 4: An Illstration of Balance Theory. a triad i, j, k is balanced if s ij = 1 and s jk = 1, then s ik = 1 ; or s ij = 1 and s jk = 1, then s ik = 1. For a triad i, j, k, there are for possible sign combinations A(+,+,+), B(+,+,) C(+,,) and D(,,) as shown in Figre 4, while only A(+,+,+) and C(+,,) are balanced. The way to verify balance theory is straightforward. We examine all these triads i, j, k and then to check the ratio of A(+,+,+) and C(+,,) among all for possible sign combinations. A high ratio sggests that balance theory is applicable to social media data We check the distribtions of for possible sign combinations on the three widely sed social media datasets (i.e., Epinions, Slashdot and Wikipedia) with signed networks in [27]. The ratios of A(+,+,+) and C(+,,) among all for possible sign combinations are 0.941, 0.912, and in Epinions, Slashdot and Wikipedia, respectively. More than 90% of triads are balanced. Similar observations on other social media datasets are reported by [68; 56]. Note that balance theory is developed for ndirected social networks and we sally ignore their directions when applying balance theory to directed social networks [27]. 3.3 Stats Theory Different from balance theory, stats theory is developed for directed social networks [28]. stats refers to the position or rank of a ser in a social commnity, and represents the degree of honor or prestige attached to the position of each individal. In stats theory, a positive link from i to j indicates that i has a higher stats than j; while a negative link from i to j indicates that i has a lower stats than j. For a triad, stats theory sggests that if we take each negative relation, reverse its direction, and flip its sign to positive, then the reslting triangle (with all positive edge signs) shold be acyclic. In [28], contextalized links are introdced to verify stats theory. A contextalized link is defined to be a triple i, j, k with the property that a link forms from i to j after each of i and j already has a link either to or from k. The link between k and i can go in either direction and have either sign yielding for possibilities, and similarly for the link between k and j; hence overall there are 4 4 = 16 different types of contextalized links. Figre 5 demonstrates 4 of 16 types of contextalized links where (A)

4 + 1 + (A) 1 + (C) (B) 1 Figre 5: An Illstration of For Ot of Sixteen Types of Contextalized Links for Stats Theory. Note that + ( or ) denotes the target node has a higher (or lower) stats than the sorce node. and (D) satisfy the stats theory, while (B) and (C) do not satisfy the stats theory. For each of these types of contextalized links, we can cont freqencies of positive verss negative links for the links from i to j and then calclate the ratio of contextalized links satisfying stats theory. In [53], it is reported that 99% of triads in the Enron social network and the advisoradvisee social network satisfy stats theory. Similar patterns are observed on Epinions and Wikipedia datasets in [28]. 3.4 Discssion The scale and properties of social media data sbstantially differ from these of data sed by social sciences to develop social theories. Since social media data is a new type of social data, it is possible to apply some social theories to explain phenomena in social media data. The verification of social theories in social media data not only paves a way for s to nderstand social media data from a social perspective bt also sggests that it is highly possible to facilitate social media mining tasks by integrating social theories with comptational methods. 4. SOCIAL THEORIES IN SOCIAL MEDIA MINING TASKS media mining is an emerging discipline nder the mbrella of data mining and grows rapidly in recent years [70]. The verification of some social theories in social media data sggests that we shold pt social in social media mining and encorages a large body of literatre to model and exploit social theories to advance social media mining tasks. In general, there are three types of objects in social media data sers, social relations and ser generated content, which allows s to roghly classify social media mining tasks to three grops based on the mining objects serbased tasks, relationbased tasks and contentbased tasks. Next we elaborate each grop with representative tasks with their definitions, challenges and the startoftheart algorithms to apply social theories to these specific tasks. (D) 4.1 Theories in UserRelated Tasks For individals, a better nderstanding of their social networks can help them share and collect reliable information more effective and efficient. For social media service providers, a better nderstanding of their cstomers can help them provide better services. Userrelated tasks provide necessary and effective means to nderstand social media sers. In this sbsection, we review social theories in some key serrelated tasks Commnity Detection Commnities in social media can be explicit sch as Yahoo! Grops. However, in many social media sites, commnities are implicit and their members are obscre to social media sers. Commnity detection is proposed to find these implicit commnities in social media by identifying grops of sers that are more densely connected to each other than to the rest of the network [55]. Detecting implicit commnities can benefits many social media mining tasks sch as social targeting and personalization. The major difference between clstering in data mining and commnity detection is that in commnity detection, individals are connected to others in social networks; while in clstering, data points are not embedded in a network and they are assme to be independent and identically distribted. Formally, for a social network G(U,S), commnity detection is to find a set of commnities C where sers are more densely connected within a commnity than to the rest of sers. Homophily sggests that similar sers are likely to be linked, and inflence indicates that linked sers will inflence each other and become more similar. The sggestions from social correlation theory in creating new ties based on the similarity gives rise to macro patterns of associations, also known as commnities [7]. Two sers in the same commnity have higher similarity [44]. The modlarity maximization method is to maximize the sm of the actal nmber of social relations between two sers mins expected nmber of social relations between them since two sers in the same commnity shold have a higher probability to establish a relation than two randomly chosen sers [43]. Wang et al. [60] find that sers within the commnity are likely to share similar tags in social tagging systems and they take advantage of the bipartite network between sers and tags in social tagging systems to discover these overlapping commnities. In [66], a densitybased framework is proposed with the intition that sers in the same commnity shold interact more freqently with each other. Recently applying balance theory to detect commnities from signed networks has attracted increasing attention. In [11], a generalized balance theory is proposed where a network is kbalanced iff sers can be partitioned in to ksbsets sch that positive links lie within the sets and the negative links between them. Balance theory sggests that the assignment of sers related by negative links shold be done the opposite way of positive links, with negative links sparse within and more dense between commnities therefore the potts model is extended to incorporate both positive and negative links to detect commnities in signed network [58]. In [2], a twoobjective approach is proposed for commnity detection in signed networks based on balance theory. One is that the partitioning shold have dense positive intraconnections and sparse negative interconnections, and the other is that it shold have as few as possible negative intraconnections and positive interconnections User Classification

5 De to privacy concerns, social media sers tend to hide their profiles. For social media service providers, sers profile information is sefl for them to cstomize their services to the sers in many ways sch as friend and content recommendations and personalized search. More they know abot sers and their preferences, better they can serve them. Given a social network and some ser information (attribtes, preferences or behaviors), ser classification is designed to infer the information of other sers in the same network [15]. In the ser classification problem, sers in U are partially labeled as U = [U L,U U ] where U L and U U are the sets of labeled and nlabeled sers, respectively. Formally the task of ser classification is to label sers from a finite set of categorical vales in U U with the social network G(U,S) and U L. correlation theory sggests that the labels of linked sers shold be correlated, which is the major reason why researchers believe that the labels of U L can be predicted with the network strctre and the partially labeled sers [15]. correlation theory is the nderlying assmption of most of existing ser classification methods, which design algorithms for collective classification. A typical ser classification algorithm incldes parts of the three components[37]: A local classifier it is sed for initial label assignment; A relational classifier it learns a classifier from the labelsofitsneighborstothelabelofonesersggested by social correlation theory; and Collective classification it applies relational classifier to each node iteratively ntil the inconsistency between neighboring labels is minimized. In [36], a weightedvote relational neighborhood classifier wvrn is introdced for ser classification. wvrn is like a lazy learner and estimates the labels of sers as the weighted mean of their neighbors. In [34], the proposed framework first creates relational featres of one ser by aggregating the label information of its neighbors and then a relational classifier can be constrcted based on labeled data. Neville and Jensen in [42] propose to se clstering algorithms to find ot the clster memberships of each ser first, and then fix the latent grop variables for later inference. Xiang et al. [64] propose a novel latent relational model based on coplas. It can make predictions in a discrete label space while ensring identical marginals and at the same time incorporating some desirable properties of modeling relational dependencies in a continos space. A commnitybased framework is proposed in [54]. It first extracts overlapping commnities based on social network strctre, then ses commnities as featres to represent sers and finally a traditional classifier sch as SVM is trained to assign labels for nlabeled sers in the same network Spammer Detection media has become an important and efficient way to disseminate information. Given its poplarity and biqity, social spammers create many fake acconts and send ot nsolicited commercial content [62]. spammers have become rampant and the volme of spam has increased dramatically. For example, 83% of the sers of social networks have received at least one nwanted friend reqest or message [47]. This not only cases misse of commnication bandwidth, storage space and comptational power, bt also wastes sers time and violates their privacy rights. Therefore developing effective social spammer detection techniqes is critically important in improving ser experience and positively affecting the overall vale of social media services[47]. Given a social network G(U, S), social spammer detection is to find a set of spammers U S from U with U S U. Based on social correlation theory, there are two observations for normal sers and spammers [73]. First normal sers perform similarly with their neighbors. Second, spammers perform differently from their neighbors since most of their neighbors are normal sers. Therefore a social reglarization term is proposed nder the matrix factorization framework to model these observations where two connected normal sers shold be close in the latent space since they share similar interests and may perform similar social activities, while spammers shold be far away from their neighbors in the latent space. In Twitter, sers have directed following relations and spammers can easily follow a large nmber of normal sers within a short time. In [19], we divide serser following relations in Twitter into for types [spammer, spammer], [normal, normal], [normal, spammer], and [spammer, normal]. Since the forth relation can be intentionally faked by spammers, we only consider the first three types of relations. Specifically we introdce a graph reglarization term to model social correlation theory in the directed social relations, which is integrated into the standard Lasso formlation to train a linear classification for social spammer detection. Spammers and normal sers have very different social behaviors. Normal sers are likely to form a grop with other normal sers, while spammers are likely to from spammer grops [29]. In [6], the athors incorporate commnitybased featres of sers with basic topological featres to improve spammer classifiers. It first finds overlapping commnity strctre of sers and then extracts featres based on these commnities sch as the featres which express the role of a ser in the commnity strctre like a bondary node or a core node and the nmber of commnities it belongs to. 4.2 Theories in RelationRelated Tasks A social network is sally represented by a binary adjacent matrix. First the matrix is extremely sparse since there are many pairs of sers with missing relations. Second, social networks in social media are more complicated. For example, strengths of relations might be heterogeneos sch as acqaintances and best friends, while a social network may a composite of varios types of relations sch as family, classmates and colleages. RelationRelated tasks focs on mining relations among sers and aim to reveal a finegrained and comprehensive view of social relations. Signed networks arise in social network with varios ways when sers can implicitly or explicitly tag their relationship with other sers as positive or negative. In this section, we review social theories in some key relationrelated tasks on signed and nsigned networks Link Prediction It is critical for social media sites to provide services to encorage more ser interactions with better experience sch as expanding one s social network. One effective way is to atomatically recommend connections since it is hard for sers to figre ot who is available on social media sites.

6 Most social media sites provide friend recommendation services to their cstomers sch as Facebook, Twitter and LinkedIn. The essential problem of friend recommendation is known as link prediction [30]. When there is no relation between i and j, S ij = 0. The task of link prediction is to predict which pairs of sers i and j withot relations S ij = 0 are likely to get connected given a social network G(U, S). Unsigned Networks : Homophily in social correlation theory sggests that similar sers are likely to establish social relations. In [30], varios similarity measrements sch as common neighbors based on the network strctre are reviewed for link prediction. One challenging problem in link prediction is the sparsity problem some sers may have very few or even no links. In [49], a lowrank matrix factorization framework with homophily effect htrst is proposed to predict trst relations. Homophily coefficients are defined to measre the strength of homophily among sers. The stronger homophily between two sers is, the smaller distance between them in the latent space is. Homophily reglarization is then defined to model homophily effect by controlling sers distances in the latent space with the help of homophily coefficients. Throgh homophily reglarization, trst relations can be sggested to sers with few or even no relations and mitigate the sparsity problem in link prediction. The confonding effect in social correlation theory sggests that people who share high degree of overlap in their trajectories are expected to have a better likelihood of forming new links. In [59], the effect of confonding is investigated for link prediction. Specifically, it leverages mobility information to extract featres which can captre some degree of closeness in physical world between two individals. Stats theory sggests new links are more likely to be attached from sers with low statses to these with high states and the preferential attachment models are widely sed to predict link prediction based on stats measres sch as the degree of nodes and PageRank [5]. Signed Networks : In [27], localtopologybased featres (or 16 triad types) based on balance theory and stats theory are extracted to improve the performance of a logistic regression classifier in signed relation prediction. In [13], the athors se a probabilistic treatment of trst combined with a modified springembedded layot algorithm to classify a relation based on balance theory. Instead of having all sers repel, the model adds a repelling force only between sers connected with a negative relation to captre balance theory. For example, one is friends with an enemy of the other; the forces will psh them in different locations. In [10], the athors show how any qantitative measre of social imbalance in a network can be sed to derive a link prediction algorithm and extend the approach in [27] by presenting a spervised machine learning based link prediction method that ses featres derived from longer cycles in the network. The motivation to derive featres from longer cycles is that higher order cycles in a signed network yield a measre of imbalance sggested the balance theory. In [18], it shows that the notion of weak strctral balance in signed networks natrally leads to a global lowrank model for the network. Under sch a model, the sign inference problem can be formlated as a lowrank matrix completion problem Tie Prediction networks in social media can be a composite of varios types of relations. For example, the relation types in Facebook cold be family, colleages, classmates and friends. However, in most online networks sch as Facebook, Twitter and LinkedIn, sch type information is sally navailable [56]. Different types of relations may inflence people in different ways. For example, one ser s work style may be mainly inflenced by her/his colleages; while the daily life habits may be strongly affected by her/his family. It is necessary and important to reveal these different types of social relations therefore we ask whether we can atomatically infer the types of social relations for social networks in social media. A novel task of social tie prediction is designed to answer the above qestion, which aims to predict the type of a given social relation. A nonzero vale of S ij sggests that there is a connection between i and j. Formally social tie prediction is to predict the type of a social relation between i and j with S ij 0 from a finite set of categorical types sch as { family, classmates, colleages and friends}. In [53], a framework is proposed to classify the type of social relationships by learning across heterogeneos networks. The framework incorporates social theories sch as balance theory and stats theory into a factor graph model, which effectively improves the accracy of inferring the type of social relationships in a target network by borrowing knowledge from a different sorce network. Balance theory and stats theory shold be general over different types of networks. To learn knowledge from the sorce network to the target network, transfer featres are extracted based on balance theory and stats theory, which are shared by different types of networks. In particlar, from social balance, the paper extracts triad based featres to denote the proportion of different balanced triangles in a network; and from stats theory, it defines featres over triads to respectively represent the probabilities of the seven most freqent formations of triads. Different from [53], approaches are sggested by [68] to model balance theory and stats theory mathematically. To model the balance theory, it introdces an onedimensional latent factor β i for each ser i and defines the sign between i and j as s ij = β iβ j. To model stats theory, it introdces a global serindependent parameter η to captre the partial ordering of sers. η maps the latent ser profile of i γ i to a scalar qantity l i = ηγ i, which reflects the corresponding ser i s social stats. According to stats theory, it characterizes social ties from i to j by modeling the relative stats difference between them as l ij = l i l j Tie Strength Prediction media sers can have hndreds of social relations. However, a recent stdy shows that Twitter sers have a very small nmber of friends compared to the nmber of followers and followees they declare [21]. The low cost of link formation in social media can lead to networks with heterogeneos relationship strengths (e.g., acqaintances and best friends mixed together) [65]. Pairs of sers with strong strengths are likely to share greater similarity than those with weak strengths; therefore a better nderstanding of strengths of social relations can help social media sites serve their cstomers well sch as better recommendations and more effective friend management tools, which arises the problem of tie strength prediction. In the binary relation presentation, once there is a connection between i and j, S ij = 1. The task of tie strength prediction is to predict a connection strength between 0 and 1 for i and j with

7 S ij = 1. After tie strength prediction, the binary relation representation matrix S ij {0,1} will be converted into a continosvaledrelationrepresentationmatrixs ij [0,1]. In [24], gided by social correlation theory, for different categories of featres, i.e., attribte similarity, topological connectivity, transactional connectivity, and network transactional connectivity, are extracted from sorces inclding friendship links, profile information, wall postings, pictre postings, and grop memberships. Then varios classifiers are trained to predict link strength from transactional information based on these extracted featres. A nspervised latent variable model is proposed to predict tie strength in online social network [65] with ser profiles and interactions. One key nderlying assmption of the proposed model is social correlation theory. Homophily in social correlation theory postlates that sers tend to form ties with other people who have similar characteristics, and it is likely that the stronger the tie, the higher the similarity. Ths the proposed framework models the tie strength as homophily effect of nodal profile similarities. The relationship strength directly inflences the natre and freqency of online interactions between a pair of sers. The stronger the relationship, the higher likelihood that a certain type of interaction between the pair of sers. Therefore the propose framework models the relationship strength as the hidden case of inflence among sers. 4.3 Theories in ContentRelated Tasks Nmeros techniqes are developed for varios content mining tasks sch as classification and clstering in the last decade. User generated content in social media is sally linked, noisy, highly nstrctred and incomplete, which determines that existing techniqes become difficlt when applying these mining tasks on ser generated content in social media. Before the poplarity of social media, researchers have already noticed that exploiting link information can improve content classification [72] and clstering [32]. The poplarity of social media makes social relations pervasively available, which encorages the exploitation of social relations in more and more mining tasks. theories can help s nderstand social relations better and in this sbsection, we review how social theories help some representative contentrelated tasks Recommendation The pervasive se of social media generates massive data in an nprecedented rate and the information overload problem becomes increasingly serve for social media sers. Recommendation has been proven to be effective in mitigating the information overload problem and presents its significance to improve the qality of ser experience, and to positively impact the sccess of social media. Users in the physical world are likely to seek sggestions from their friends before making a prchase decision and sers friends consistently provide good recommendations [45], we have similar observations in the online worlds. For example, 66% of people on social sites have asked friends or followers to help them make a decision and 88% of links that 1424 year olds clicked were sent to them by a friend and 78% of consmers trst peer recommendations over ads and Google SERPs 1. These 1 intitions motive a new research direction of recommendation social recommendation, which aims to take advantage of social relations to improve the performance of recommendation. Formally, a social recommender system is to predict missing vales in the sercontent interaction matrix R based on information from the serser relation matrix S and the observed vales in R [51]. The major reason why people believe that social relations are helpfl to improve recommendation performance is evidence from social correlation theory, which sggests that a ser s preference is similar to or inflenced by their directly connected friends [51]. Therefore social media sers rarely make decisions independently and sally seek advice from their friends before making prchase decisions. relations may provide both similar and familiar evidence for sers, MoleTrst ses socially connected sers to replace similar sers in traditional serbased collaborative filtering method for recommendation in [39]. correlation theory indicates that a ser s preference shold be similar to her/his social network. Ensemble methods predict a missing vale for a given sers as a linear combination of ratings from the ser and her/his social network based on traditional matrix factorization CF method with the intition that sers and their social networks shold have similar ratings on the same items [50]. While reglarization methods add reglarization terms to force the preference of a ser close to that of sers in her/his social network nder the matrix factorization CF method. For example, MF defines a reglarization term to force the preference of a ser to be close to the average preference of the ser s social network [22], and SoReg ses social reglarization to force the preferences of two connected sers close [35] Featre Selection One characteristic of ser generated content in social media is highdimensional sch as there are tens of thosands of terms in tweets or pixels for photos in Flickr. Traditional data mining tasks sch as classification and clstering may fail de to the crse of dimensionality. Featre selection has been proven to be an effective way to handle highdimensional data for efficient data mining [31]. As mentioned above, ser generated content is linked de to the availability of social relations and poses challenges to traditional featre selection algorithms which are typically designed for IID data. The formal definition of featre selection for ser generated content in social media is stated as [52] we aim to develop a selector which selects a sbset of most relevant featres from F on the contentfeatre matrix C with its social context S and R. LinkedFS is proposed as a featre selection framework for ser generated content with social context based on social correlation theory in [52]. For types of relations, i.e., co Post, cofollowing, cofollowed and Following, are extracted from social context S and R of ser generated content C. correlation theory sggests that linked sers are likely to share similar topics. Based on social correlation theory, LinkedFS trns these for types of relations to for corresponding hypotheses that can affect featre selection with linked data. For example, following hypothesis assmes that one ser i follows another ser j becase i share j s interests, and their ser generated content is more likely similar in terms of topics; hence LinkedFS models following relations mathematically by forcing topics of two sers with

8 following relations close to each other. LinkedFS jointly incorporates grop Lasso with the reglarization term to model each type of relations for featre selection Sentiment Analysis Nowadays social media services sch as Twitter and Facebook are increasingly sed by online sers to share and exchange opinions, providing rich resorces to nderstand pblic opinions. For example, in [3], a simple model exploiting Twitter sentiment and content otperforms marketbased predictors in terms of forecasting boxoffice revenes for movies; pblic mood as measred from a largescale collection of tweets obtains an accracy of 86.7% in predicting the daily p and down changes in the closing vales of the DJIA [8]. Therefore sentiment analysis for sch opinionrich social media data has attracted increasing attention in recent years [46; 20]. Formally sentiment analysis for ser generated content with social relations is to obtain a predictor from the contentfeatre matrix C with its social context S and R, which can atomatically label the sentiment polarity of an nseen post. correlation theory indicates that sentiments of two linked sers are likely to be similar. In [48], graphical models are proposed to incorporate social network information to improve serlevel sentiment classification of different topics based on two observations (1) ser pairs in which at least one party links to the other are more likely to hold the same sentiment, and (2) two sers with the same sentiment are more likely to have at least one link to the other than two sers with different sentiment. correlation theory sggests that social relations are kinds of sentiment correlations. In [46], the athors propagate sentiment labels of tweets via serser social relations S and sertweet relations R to assign sentiment labels to nlabeled tweets. In [20], tweettweet correlation network are bilt from S and R based on social correlation theory. For example, tweets from sers with following relations shold be correlated as sggested by social correlation theory. Two tweets linked in the tweettweet correlation network are likely to share similar sentiments; hence the proposed framework SANT adds a graph reglarization term in the Lasso classifier to force the sentiments of two correlated tweets close to each other. 4.4 Discssion In reviewing stateoftheart algorithms that exploit social theories in mining social media, we nderstand that they aim to find mathematical explanations of social theories for comptational models. We notice that algorithms share similar ways in applying social theories sch as featre engineering, constraint generating and objective defining. Featre Engineering: It ses social theories to extract featres for comptational models. For example, in link prediction, confonding effect in social correlation theory sggests that people who are physically close have a better likelihood of forming new links and new featres from sers mobility information are extracted in [59] to improve link predilection; while triad featres based on stats theory are extracted as transfer featres to infer social ties by transferring knowledge from the sorce network to the target network [53]. Constraint Generating: It generates constraints from social theories for comptational models. Reglariza Media Mining Tasks User Related Relation Related Content Related Featre Engineering Constraint Generating Objective Defining Commnity Detection [42],[59],[65],[57],[2] User Classification [35],[33],[41],[63],[53] Spammer Detection [28],[6] [72],[18] Link Prediction [29],[58],[26],[10] [48] [5],[12],[17] Tie Prediction [52] [67] Tie Strength Prediction [23] [64] Recommendation [21],[23] [38] Featre Selection [51] Sentiment Analysis [47],[19] [45] Figre 6: Theories in Media Mining. tion is one of the most poplar ways to implement constraint generating. For example, MF in social recommendation adds a social reglarization term to force the performance of a ser close to that of her/his social network to captre social correlation theories [22]; and htrst adds a homophily reglarization term to captre homophily effect and mitigate the sparsity problem in link prediction [49]. Objective Defining: It ses social theories to define the objectives of the comptational models. For example, two objectives are defined from balance theory to detect commnities in signed networks [2]; and the ser classification task is to make the labels of a ser similar to these of her/his social network [15]. Instead of brteforce search, social theories can gide s to extract relevant featres via featre engineering, to generate constraints via constraint generating, and to define objectives via objective defining for comptational models. The algorithms reviewed earlier that exploit social theories in varios social media mining tasks are smmarized in Figre 6. We notice that for the same task, social theories can be exploited in different ways. For example, for link prediction, social theories are explained via featre engineering, constraint generating and objective defining. 5. OPEN ISSUES AND FUTURE RESEARCH DIRECTIONS 5.1 More in Mining Media Data Some social theories have been proven to be applicable to social media data, which encorages s to pt social in social media mining. Integrating some social theories with comptational models advances varios social media mining tasks and has attracted increasing attention. The exciting progress not only proves that the direction of integrating social theories in mining social media data is appealing bt also sggests that we shold pt more social in social media mining. In this article, we review the stateoftheart algorithms that employ social correlation theory, balance theory and stats theory in varios social media mining tasks. These theories are jst illstrative examples and there cold be more social theories to be applicable and employed sch as small world theory [74] as shown in recent efforts to investigate and verify more social theories for social media data. Some of these efforts have already made initial progress sch as strctral hole theory [9] and weak tie theory [16]. A person is said to span a strctral hole in a social network if he or she is linked to people in parts of the network that

9 are otherwise not well connected to one another [9]. Tang et al.[56] employ strctral hole theory in the problem of social tie prediction; while Lo and Tang confirm the importance of strctral hole in information diffsion with social media data, and show that mining strctral hole can benefit varios social media mining tasks sch as commnity detection and link prediction [33]. Weak tie theory sggests that more novel information flows to individals throgh weak rather than strong ties [16]. Recently researchers find that weak ties of a ser are helpfl to predict the preference of the ser for ser classification [54] and social recommendation [71]. 5.2 New Theories No dobt that social media data is a new type of social data and is mch more complicated than the data social sciences se to stdy social theories. It is highly possible that new social theories can be discovered from social media data to make meaningfl progress on important problems in social media mining, however, that progress reqires serios engagement of both compter scientists and social scientists [61]. Data availability is still a challenging problem for social scientists. The data reqired to address many problems of interest to social scientists remain difficlt to assemble and it has been impossible to collect observational data on the scale of hndreds of millions, or even tens of thosands, of individals [61]. media provides a virtal world for sers online activities and makes it possible for social scientists to observe social behavior and interaction data of hndreds of millions of sers. However social media data is too big to be directly handled by social scientists. On the other hand, compter scientists can employ data mining and machine learning techniqes to handle big social media data; bt, we lack necessary theories to help s nderstand social media data better. For example, withot a better nderstanding of social media data, compter scientists may waste a lot of time in featre engineering, which is the key to the sccess of many realworld applications [12]. Therefore engagement of both compter scientists and social scientists in social media data is trly mtally beneficial. Compter scientists can take advantage of social theories to mine social media data and provide comptational tools that are of great potential benefit to social scientists; while social scientists can make se of comptational tools to handle social media data and develop new social theories to help compter scientists provide better comptational tools. 6. CONCLUSION The social natre of social media data calls for new techniqes and tools and cltivates a new field social media mining. theories from social sciences have been proven to be applicable to mining social media. Integrating social theories with comptational models is becoming an interesting way in mining social media data and makes exciting progress in varios social media mining tasks. In this article, we review three key social theories, i.e., social correlation theory, balance theory and stats theory, in mining social media data. In detail, we introdce basic concepts, verification methods, interesting findings and the stateoftheart algorithms to exploit these social theories in social media mining tasks, which can be categorized to featre engineering, constraint generating and objective defining. As ftre directions, more existing social theories cold be employed or new social theories cold be discovered to advance social media mining. Acknowledgments This work is, in part, spported by NSF (#IIS ), ARO(#025071), ONR(N ) and a research fnd from Yahoo Faclty Research and Engagement Program. 7. REFERENCES [1] E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding highqality content in social media. In WSDM, [2] A. Amelio and C. Pizzti. Commnity mining in signed networks: a mltiobjective approach. In ASONAM, [3] S. Asr and B. A. Hberman. Predicting the ftre with social media. In WIIAT, [4] E. Bakshy, I. Rosenn, C. Marlow, and L. Adamic. The role of social networks in information diffsion. In WWW, [5] A.L. Barabási and R. Albert. Emergence of scaling in random networks. science, [6] S. Y. Bhat and M. Ablaish. Commnitybased featres for identifying spammers in online social networks. In ASONAM, pages ACM, [7] H. Bisgin, N. Agarwal, and X. X. Investigating homophily in online social networks. In WIIAT, [8] J. Bollen, H. Mao, and X. Zeng. Twitter mood predicts the stock market. Jornal of Comptational Science, 2(1):1 8, [9] R. S. Brt. Strctral holes: The social strctre of competition. Harvard niversity press, [10] K.Y. Chiang, N. Natarajan, A. Tewari, and I. S. Dhillon. Exploiting longer cycles for link prediction in signed networks. In CIKM, [11] J. A. Davis. Clstering and strctral balance in graphs. Hman relations, [12] P. Domingos. A few sefl things to know abot machine learning. Commnications of the ACM, [13] T. DBois, J. Golbeck, and A. Srinivasan. Predicting trst and distrst in social networks. In socialcom, [14] H. Gao, J. Tang, and H. Li. Exploring socialhistorical ties on locationbased social networks. In ICWSM, [15] L. Getoor and C. P. Diehl. Link mining: a srvey. ACM SIGKDD Explorations Newsletter, [16] M. Granovetter. The strength of weak ties. JSTOR, [17] F. Heider. Attitdes and cognitive organization. The Jornal of psychology, 1946.

10 [18] C.J. Hsieh, K.Y. Chiang, and I. S. Dhillon. Low rank modeling of signed networks. In KDD, [19] X. H, J. Tang, Y. Zhang, and H. Li. spammer detection in microblogging. In IJCAI, [20] X. H, L. Tang, J. Tang, and H. Li. Exploiting social relations for sentiment analysis in microblogging. In WSDM, [21] B. Hberman, D. M. Romero, and F. W. networks that matter: Twitter nder the microscope. First Monday, [22] M. Jamali and M. Ester. A matrix factorization techniqe with trst propagation for recommendation in social networks. In Recsys, [23] D. Jensen and J. Neville. Linkage and atocorrelation case featre selection bias in relational learning. In ICML, [24] I. Kahanda and J. Neville. Using transactional information to predict link strength in online social networks. In ICWSM, [25] D. Kim, D. Kim, E. Hwang, and S. Rho. Twittertrends: a spatiotemporal trend detection and related keywords recommendation scheme. Mltimedia Systems,2014. [26] M. Kosinski, D. Stillwell, and T. Graepel. Private traits and attribtes are predictable from digital records of hman behavior. PNAS, [27] J. Leskovec, D. Httenlocher, and J. Kleinberg. Predicting positive and negative links in online social networks. In WWW, [28] J. Leskovec, D. Httenlocher, and J. Kleinberg. Signed networks in social media. In CHI, [29] F. Li and M.H. Hsieh. An empirical stdy of clstering behavior of spammers and gropbased antispam strategies. In CEAS, [30] D. LibenNowell and J. Kleinberg. The linkprediction problem for social networks. JASIST, [31] H. Li and H. Motoda. Comptational methods of featre selection. CRC Press, [32] B. Long, Z. M. Zhang, X. W, and P. S. Y. Spectral clstering for mltitype relational data. In ICML, [33] T. Lo and J. Tang. Mining strctral hole spanners throgh information diffsion in social networks. In WWW, [34] Q. L and L. Getoor. Linkbased classification. In ICML, [35] H.Ma, D.Zho, C.Li, M.R.Ly, andi.king.recommender systems with social reglarization. In WSDM, [36] S. A. Macskassy and F. Provost. A simple relational classifier. In MRDM, [37] S. A. Macskassy and F. Provost. Classification in networked data: A toolkit and a nivariate case stdy. JMLR, [38] P. Massa. A srvey of trst se and modeling in real online systems. Trst in Eservices: Technologies, Practices and Challenges, [39] P. Massa and P. Avesani. Trstaware collaborative filtering for recommender systems. In CoopIS, DOA, and ODBASE, [40] M. McPherson, L. SmithLovin, and J. M. Cook. Birds of a feather: Homophily in social networks. Annal review of sociology, [41] A. Mislove, B. Viswanath, K. P. Gmmadi, and P. Drschel. Yo are who yo know: inferring ser profiles in online social networks. In WSDM, [42] J. Neville and D. Jensen. Leveraging relational atocorrelation with latent grop models. In MRDM, [43] M. E. Newman and M. Girvan. Finding and evalating commnity strctre in networks. PRE, 69(2):026113, [44] S. Papadopolos, Y. Kompatsiaris, A. Vakali, and P. Spyridonos. Commnity detection in social media. DMKD, [45] R. R. Sinha and K. Swearingen. Comparing recommendations made by online systems and friends. In DELOS, [46] M. Sperios, N. Sdan, S. Upadhyay, and J. Baldridge. Twitter polarity classification with label propagation over lexical links and the follower graph. In ULNLP, [47] G. Stringhini, C. Kregel, and G. Vigna. Detecting spammers on social networks. In ACSAC, [48] C. Tan, L. Lee, J. Tang, L. Jiang, M. Zho, and P. Li. Userlevel sentiment analysis incorporating social networks. In KDD, [49] J. Tang, H. Gao, X. H, and H. Li. Exploiting homophily effect for trst prediction. In WSDM, [50] J. Tang, H. Gao, and H. Li. mtrst: discerning mltifaceted trst in a connected world. In WSDM, [51] J. Tang, X. H, and H. Li. recommendation: a review. SNAM, [52] J. Tang and H. Li. Featre selection with linked data in social media. In SDM, [53] J. Tang, T. Lo, and J. Kleinberg. Inferring social ties across heterogenos networks. In WSDM, [54] L. Tang and H. Li. Relational learning via latent social dimensions. In KDD, [55] L. Tang and H. Li. Commnity detection and mining in social media. Synthesis Lectres on Data Mining and Knowledge Discovery, 2010.

11 [56] W. Tang, H. Zhang, and J. Tang. Learning to infer social ties in large networks. In PKDD, [57] B. Taskar, P. Abbeel, M.F. Wong, and D. Koller. Label and link prediction in relational data. In SRL, [58] V. Traag and J. Brggeman. Commnity detection in networks with positive and negative links. PRE, 80(3):036115, [59] D. Wang, D. Pedreschi, C. Song, F. Giannotti, and A. L. Barabasi. Hman mobility, social ties, and link prediction. In KDD, [60] X. Wang, L. Tang, H. Gao, and H. Li. Discovering overlapping grops in social media. In ICDM, [61] D. J. Watts. Comptational social science: Exciting progress and ftre directions. Winter Isse of The Bridge on Frontiers of Engineering, [62] S. Webb, J. Caverlee, and C. P. honeypots: Making friends with a spammer near yo. In CEAS, [63] J. Weng, E.P. Lim, J. Jiang, and Q. He. Twitterrank: finding topicsensitive inflential twitterers. In WSDM, [64] R. Xiang and J. Neville. Collective inference for network data with copla latent markov networks. In WSDM, pages ACM, [65] R. Xiang, J. Neville, and M. Rogati. Modeling relationship strength in online social networks. In WWW, [66] X. X, N. Yrk, Z. Feng, and T. A. Schweiger. Scan: a strctral clstering algorithm for networks. In KDD, [67] J. Yang and J. Leskovec. Modeling information diffsion in implicit networks. In ICDM, [68] S.H. Yang, A. J. Smola, B. Long, H. Zha, and Y. Chang. Friend or frenemy?: predicting signed ties in social networks. In SIGIR, [69] M. Ye, X. Li, and W.C. Lee. Exploring social inflence for recommendation: a generative model approach. In SIGIR, [70] R. Zafarani, M. A. Abbasi, and H. Li. Media Mining: An Introdction. Cambridge University Press, [71] X. Zhang, J. Cheng, T. Yan, B. Ni, and H. L. Toprec: domainspecific recommendation throgh commnity topic mining in social network. In WWW, [72] S.Zh, K.Y, Y.Chi, andy.gong.combiningcontent and link for classification sing matrix factorization. In SIGIR, [73] Y. Zh, X. Wang, E. Zhong, N. N. Li, H. Li, and Q. Yang. Discovering spammers in social networks. In AAAI, [74] D. Watts, and S, Steven. Collective dynamics of smallworld networks. In natre, 1998.