INFORMS 2012 c 2012 INFORMS isbn 978-0-9843378-3-5 http://dx.doi.org/10.1287/educ.1120.0105 Mining Social Media: A Brief Introduction Pritam Gundecha, Huan Liu Arizona State University, Tempe, Arizona 85287 {pritam@asu.edu, huan.liu@asu.edu} Abstract Keywords The pervasive use of social media has generated unprecedented amounts of social data. Social media provides easily an accessible platform for users to share information. Mining social media has its potential to extract actionable patterns that can be beneficial for business, users, and consumers. Social media data are vast, noisy, unstructured, and dynamic in nature, and thus novel challenges arise. This tutorial reviews the basics of data mining and social media, introduces representative research problems of mining social media, illustrates the application of data mining to social media using examples, and describes some projects of mining social media for humanitarian assistance and disaster relief for real-world applications. social media; data mining; social data; social media mining; social networking sites; blogging; microblogging; crowdsourcing; HADR; privacy; trust 1. Introduction Data mining research has successfully produced numerous methods, tools, and algorithms for handling large amounts of data to solve real-world problems. Traditional data mining has become an integral part of many application domains including bioinformatics, data warehousing, business intelligence, predictive analytics, and decision support systems. Primary objectives of the data mining process are to effectively handle large-scale data, extract actionable patterns, and gain insightful knowledge. Because social media is widely used for various purposes, vast amounts of user-generated data exist and can be made available for data mining. Data mining of social media can expand researchers capability of understanding new phenomena due to the use of social media and improve business intelligence to provide better services and develop innovative opportunities. For example, data mining techniques can help identify the influential people in the vast blogosphere, detect implicit or hidden groups in a social networking site, sense user sentiments for proactive planning, develop recommendation systems for tasks ranging from buying specific products to making new friends, understand network evolution and changing entity relationships, protect user privacy and security, or build and strengthen trust among users or between users and entities. Mining social media is a burgeoning multidisciplinary area where researchers of different backgrounds can make important contributions that matter for social media research and development. The objective of this tutorial is to introduce social media, data mining, and their confluence mining social media. We attempt to achieve the goal by presenting representative and interesting research issues and important social media tasks based on our experience and research. This tutorial first reviews data mining, social media and its types, and the importance of social media mining. In 2, we briefly introduce representative issues in social media mining. In 3, we highlight the impact of social media mining using three examples based on our current research. Section 4 illustrates how social media mining is applied in some real-world applications two projects on humanitarian assistance and disaster relief (HADR) carried out in the Data Mining and Machine Learning Laboratory (DMML) at Arizona State University (ASU). We conclude this tutorial in 5. 1
2 Tutorials in Operations Research, c 2012 INFORMS 1.1. Data Mining Data mining (Tan et al. [55]) is a process of discovering useful or actionable knowledge in large-scale data. Data mining also means knowledge discovery from data (KDD) (Han et al. [24]), which describes the typical process of extracting useful information from raw data. The KDD process broadly consists of the following tasks: data preprocessing, data mining, and postprocessing. These steps need not be separate tasks and can be combined together. Data mining is an integral part of many related fields including statistics, machine learning, pattern recognition, database systems, visualization, data warehouse, and information retrieval (Han et al. [24]). Data mining algorithms are broadly classified into supervised, unsupervised, and semisupervised learning algorithms. Classification is a common example of supervised learning approach. For supervised learning algorithms, a given data set is typically divided into two parts: training and testing data sets with known class labels. Supervised algorithms build classification models from the training data and use the learned models for prediction. To evaluate a classification model s performance, the model is applied to the test data to obtain classification accuracy. Typical supervised learning methods include decision tree induction, k-nearest neighbors, naive Bayes classification, and support vector machines. Unsupervised learning algorithms are designed for data without class labels. Clustering is a common example of unsupervised learning. For a given task, unsupervised learning algorithms build the model based on the similarity or dissimilarity between data objects. Similarity or dissimilarity between the data objects can be measured using proximity measures including Euclidean distance, Minkowski distance, and Mahalanobis distance. Other proximity measures such as simple matching coefficient, Jaccard coefficient, cosine similarity, and Pearson s correlation can also be used to calculate similarity or dissimilarity between the data objects. K-means, hierarchical clustering (agglomerative or partitional methods), and density-based clustering are typical examples of unsupervised learning. Semisupervised learning algorithms are most applicable where there exist small amounts of labeled data and large amounts of unlabeled data. Two typical types of semisupervised learning are semisupervised classification and semisupervised clustering. The former uses labeled data to make classification and unlabeled data to refine the classification boundaries further, and the latter uses labeled data to guide clustering. Cotraining is a representative semisupervised learning algorithm. Active learning algorithms allow users to play an active role in the learning process via labeling. Typically, users are domain experts and their skills are employed to label some data instances for which a machine learning algorithm are confident about its classification. Minimum marginal hyperplane and maximum curiosity are two popular active learning algorithms. Data mining includes other techniques such as association rule mining, anomaly detection, feature selection, instance selection, and visual analytics. Additional details related to these data mining techniques can be found in Han et al. [24], Tan et al. [55], Witten et al. [59], Zhao and Liu [63], and Liu and Motoda [37]. 1.2. Social Media Social media (Kaplan and Haenlein [28]) is defined as a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0 and that allow the creation and exchanges of user-generated content. Social media is conglomerate of different types of social media sites including traditional media such as newspaper, radio, and television and nontraditional media such as Facebook, Twitter, etc. Table 1 shows characteristics of different types of social media. Social media gives users an easy-to-use way to communicate and network with each other on an unprecedented scale and at rates unseen in traditional media. The popularity of social media continues to grow exponentially, resulting in an evolution of social networks, blogs,
Tutorials in Operations Research, c 2012 INFORMS 3 Table 1. Characteristics of different types of social media. Type Online social networking Blogging Microblogging Wikis Social news Social bookmarking Media sharing Opinion, reviews, and ratings Answers Characteristics Online social networks are Web-based services that allow individuals and communities to connect with real-world friends and acquaintances online. Users interact with each other through status updates, comments, media sharing, messages, etc. (e.g., Facebook, Myspace, LinkedIn). A blog is a journal-like website for users, aka bloggers, to contribute textual and multimedia content, arranged in reverse chronological order. Blogs are generally maintained by an individual or by a community (e.g., Huffington Post, Business Insider, Engadget). Microblogs can be considered same a blogs but with limited content (e.g., Twitter, Tumblr, Plurk). A wiki is a collaborative editing environment that allow multiple users to develop Web pages (e.g., Wikipedia, Wikitravel, Wikihow). Social news refers to the sharing and selection of news stories and articles by community of users (e.g., Digg, Slashdot, Reddit). Social bookmarking sites allow users to bookmark Web content for storage, organization, and sharing (e.g., Delicious, StumbleUpon). Media sharing is an umbrella term that refers to the sharing of variety of media on the Web including video, audio, and photo (e.g., YouTube, Flickr, UstreamTV). The primary function of such sites is to collect and publish usersubmitted content in the form of subjective commentary on existing products, services, entertainment, businesses, places, etc. Some of these sites also provide products reviews (e.g., Epinions, Yelp, Cnet). These sites provide a platform for users seeking advice, guidance, or knowledge to ask questions. Other users from the community can answer these questions based on previous experiences, personal opinions, or relevent research. Answers are generally judged using ratings and comments (e.g., Yahoo! answers, WikiAnswers). microblogs, location-based social networks (LBSNs), wikis, social bookmarking applications, social news, media (text, photo, audio, and video) sharing, product and business review sites, etc. Facebook, 1 the social networking site, recorded more than 845 million active users as of December 2011. This number suggests that China (approximately 1.3 billion) and India (approximately 1.1 billion) are the only two countries in the world that have larger populations than Facebook. Facebook and Twitter have accrued more than 1.2 billion users, 2 more than thrice the population of the United States and more than the population of any continent except Asia. 1.3. Mining Social Media Vast amounts of user-generated content are created on social media sites every day. This trend is likely to continue with exponentially more content in the future. Hence, it is critical for producers, consumers, and service providers to figure out management and utility of massive user-generated data. Social media growth is driven by these challenges: (1) How 1 http://newsroom.fb.com/content/default.aspx?newsareaid=22 (accessed March 2012). 2 Facebook and Twitter have many overlapping users. Hence, 1.2 billion users are not unique.
4 Tutorials in Operations Research, c 2012 INFORMS can a user be heard? (2) Which source of information should a user use? (3) How can user experience be improved? Answers to these questions are hidden in the social media data. These challenges present ample opportunities for data miners to develop new algorithms and methods for social media. Data generated on social media sites are different from conventional attribute-value data for classic data mining. Social media data are largely user-generated content on social media sites. Social media data are vast, noisy, distributed, unstructured, and dynamic. These characteristics pose challenges to data mining tasks to invent new efficient techniques and algorithms. For example, Facebook 3 and Twitter 4 report Web traffic data from approximately 149 million and 90 million unique U.S. visitors per month, respectively. According to the video sharing site YouTube, 5 more than 4 billion videos are viewed per day, and 60 hours of videos are uploaded every minute. The picture sharing site Flickr, 6 as of August 2011, hosts more than 6 billion photo images. Web-based, collaborative, and multilingual Wikipedia 7 hosts over 20 million articles attracting over 365 million readers. Depending on social media platforms, social media data can often be very noisy. Removing the noise from the data is essential before performing effective mining. Researchers notice that spammers (Yardi et al. [61], Chu et al. [12]) generate more data than legitimate users. Social media data are distributed because there is no central authority that maintains data from all social media sites. Distributed social media data pose a daunting task for researchers to understand the information flows on the social media. Social media data are often unstructured. To make meaningful observations based on unstructured data from various data sources is a big challenge. For example, social media sites like LinkedIn, Facebook, and Flickr serve different purposes and meet different needs of users. Social media sites are dynamic and continuously evolving. For example, Facebook recently brought about many concepts including a user s timeline, the creation of in-groups for a user, and numerous user privacy policy changes. The dynamic nature of social media data is a significant challenge for continuously and speedily evolving social media sites. There are many additional interesting questions related to human behavior can be studied using social media data. Social media can also help advertisers to find the influential people to maximize the reach of their products within an advertising budget. Social media can help sociologists to uncover the human behavior such as in-group and out-group behaviors of users. Recently, social media was reported to play an instrumental role in facilitating mass movements such as the Arab Spring 8 and Occupy Wall Street. 9 2. Issues in Mining Social Media In this section, we introduce some representative research issues in mining social media. 2.1. Community Analysis A community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group. Based on the context, a community is also referred to as a group, cluster, cohesive subgroup, or module. Communities can be observed via connections in social media because social media allows people to expand social networks online (Tang and Liu [56]). Social media enables people to connect friends and 3 http://www.quantcast.com/facebook.com (accessed March 2012). 4 http://www.quantcast.com/twitter.com (accessed March 2012). 5 http://www.youtube.com/t/press statistics (accessed March 2012). 6 http://news.softpedia.com/news/flickr-boasts-6-billion-photo-uploads-215380.shtml (accessed March 2012). 7 http://en.wikipedia.org/wiki/wikipedia (accessed March 2012). 8 http://en.wikipedia.org/wiki/arab Spring (accessed March 2012). 9 http://en.wikipedia.org/wiki/occupy Wall Street (accessed March 2012).
Tutorials in Operations Research, c 2012 INFORMS 5 find new users of similar interests. Communities found in social media are broadly classified into explicit and implicit groups. Explicit groups are formed by user subscriptions, whereas implicit groups emerge naturally through interactions. Community analysts are generally faced with issues such as community detection, formation, and evolution. Community detection often refers to the extraction of implicit groups in a network. The main challenges of community detections are that (1) the definition of a community can be subjective, and (2) the lack of ground truth makes community evaluation difficult. Tang and Liu [56] divided community detection methods into four categories: (1) node-centric community detection, where each node satisfies certain properties such as complete mutuality, reachability, node degrees, frequency of within and outside ties, etc. (examples include cliques, k-cliques, and k-clubs); (2) group-centric community detection, where a group needs to satisfy certain properties (for example, minimum group densities); (3) network-centric community detection, where groups are formed based on partition of network into disjoint sets (examples are spectral clustering and modularity maximization); and (4) hierarchycentric community detection, where the goal is to build a hierarchical structure of communities. This allows the analysis of a network with different resolutions. Representative methods are divisive clustering and agglomerative clustering. Social media networks are highly dynamic. Communities can expand, shrink, or dissolve in dynamic networks. Community evolution aims to discover the patterns of a community over time with the presence of dynamic network interactions. Backstrom et al. [8] found that the more friends you have in a group, the more likely you are to join, and communities with cliques grow more slowly than those that are not tightly connected. 2.2. Sentiment Analysis and Opinion Mining Sentiment analysis and opinion mining aim to automatically extract opinions expressed in the user-generated content. Sentiment analysis and opinion mining tools allow businesses to understand product sentiments, brand perception, new product perception, and reputation management. These tools help users to perceive product opinions or sentiments on a global scale. There are many social media sites reporting user opinions of products in many different formats. Monitoring these opinions related to a particular company or product on social media sites is a new challenge. Sentiment analysis is hard because languages used to create contents are ambiguous. Major steps of sentiment analysis are (1) finding relevant documents, (2) finding relevant sections, (3) finding the overall sentiment, (4) quantifying the sentiment, and (5) aggregating all sentiments to form an overview. Basic components of an opinion are (1) an object on which opinion is expressed, (2) an opinion expressed on a object, and (3) the opinion holder. Objects are generally represented as a finite set of features, where each feature represents a finite set of synonymous words or phrases. Opinion mining tasks can be performed at the document level (Turney [58], Pang et al. [49]), sentence level (Riloff and Wiebe [51], Yu and Hatzivassiloglou [62]), or feature level (Hu and Liu [25], Popescu and Etzioni [50], Liu and Maes [36]). Extracting opinions expressed in comparative sentences is a difficult task, and some preliminary work can be found in Jindal and Liu [26], Jindal and Liu [27], Liu [35], and Pang and Lee [48]. Performance evaluation of sentiment analysis is another challenge because of the lack of ground truth. 2.3. Social Recommendation Traditional recommendation systems attempt to recommend items based on aggregated ratings of objects from users or past purchase histories of users. A social recommendation system makes use of user s social network and related information in addition to the traditional recommendation means. Social recommendation is based on the hypothesis that people who are socially connected are more likely to share the same or similar interests
6 Tutorials in Operations Research, c 2012 INFORMS (homophily), and users can be easily influenced by the friends they trust and prefer their friends recommendations to random recommendations. Objectives of social recommendation systems are to improve the quality of recommendation and alleviate the problem of information overload. Examples of social recommendation systems are book recommendations based on friends reading lists on Amazon or friend recommendations on Twitter and Facebook. More details on social recommendation systems can be found in Konstas et al. [30], Ma et al. [38], and Backstrom and Leskovec [7]. 2.4. Influence Modeling Social scientists have been exploring influence and homophily in social networks for quite some time (McPherson et al. [42], Lazarsfeld and Merton [34]). It is important to know whether the underlying social network is influence driven or homophily driven. For example, in the advertisement industry, if the social network is influence driven, then the influential users should be identified and incentivized to promote the product or services to the members of the social network. However, if the social network is homophily (similarity) driven, then some individual users should be directly targeted to promote sales. Most social networks have a mixture of both homophily and influence. Hence, distinguishing them is a challenge. Aral et al. [5] and La Fond and Neville [33] gave details on distinguishing social influence and homophily in social networks. Kempe et al. [29] studied the influence maximization problem. For a given information propagation model, influence maximization aims to identify the set of initial influential users from a given snapshot of a social network such that they can influence the maximum number of other users within given budget constraints. Agarwal et al. [3, 4] presented a preliminary model to identify influential bloggers in a community. The blogosphere obeys a power law distribution with a few blogs being extremely influential and a huge number of blogs being largely unknown. They reported that active bloggers are not necessarily influential and proposed effective influence measures to identify influential bloggers. A more general discussion on modeling and data mining in the blogosphere is given by Agarwal and Liu [2]. 2.5. Information Diffusion and Provenance Researchers study how information diffuses and explore different models of information diffusion, including the independent cascade model, the threshold model, the susceptible infected model, and the susceptible infected recovered model. Many such models have been studied (Bailey [10], Granovetter [21], Mahajan [40], Macy [39], Berger [11]). Researchers apply these models to analyze the spread of rumors, computer viruses, and diseases during outbreaks. Two important problems from the social media viewpoint are (1) how information spreads in a social media network and which factors affect the spread, and (2) what plausible sources are, given some information from social media. The first problem of information diffusion has received good attention from researchers. The second problem is still an open research problem information provenance in social media and recognized as a key issue to differentiate rumors from truth. Because social media data are distributed and dynamic, the conventional techniques used in classical provenance research cannot be directly applied in social media. 2.6. Privacy, Security, and Trust The low barrier to and pervasive use of social media give rise to concerns on user privacy and security issues. New challenges arise due to user s opposing needs: on one hand, a user would like to have as many friends and share as much as possible, and on the other hand, a user would like to be as private as possible when needed. However, being gregarious requires openness and transparency, but being private constricts one s sharing. In addition, a social networking site has its business needs to encourage users to easily find each other and expand
Tutorials in Operations Research, c 2012 INFORMS 7 their friendship networks as widely as possible in other words, to be open. Hence, social media poses new security challenges to fend off security threats to users and organizations. With the variety of personal information disclosed in user profiles (e.g., information about other users and user networks may be indirectly accessible), individuals may put themselves and members of their social networks at risk for a variety of attacks. Social media has been the target of numerous passive as well as active attacks including stalking, cyberbullying, malvertizing, phishing, social spamming, scamming, and clickjacking. Gross and Acquisti [22] showed that only a few users change the default privacy preferences on Facebook. In some cases, user profiles are completely public, making information available and providing a communication mechanism to anyone who wants to access it. It is no secret that when a profile is made public, malicious users including stalkers, spammers, and hackers can use sensitive information for their personal gain. Sometimes malevolent users can even cause physical or emotional distress to other users (Rosenblum [52]). Narayanan and Shmatikov [45, 46] demonstrated how users privacy can be weakened if an attacker knows the presence of connections among users. Wondracek et al. [60] presented a successful scheme to breach privacy by exploiting only the group membership information of users. Liu and Maes [36] pointed out a lack of privacy awareness and found a large number of social network profiles in which people described themselves with a rich vocabulary in terms of their passions and interests. Krishnamurthy and Wills [31] discussed the problem of leakage of personally identifiable information and how it can be misused by third parties (Narayanan and Shmatikov [46]). Squicciarini et al. [53] introduced a novel collective privacy mechanism for better managing shared content between users. Fang and LeFevre [14] focused on helping users to understand simple privacy settings, but did not consider additional problems such as attribute inference (Zheleva and Getoor [64]) or shared data ownership (Squicciarini et al. [53]). Zheleva and Getoor [64] showed how an adversary can exploit an online social network with a mixture of public and private user profiles to predict the private attributes of users. Baden et al. [9] presented a framework where users dictate who may access their information based on public private encryption decryption algorithms. Social trust depends on many factors that cannot be easily modeled in a computational system. Many different versions of definition of trust are proposed in the literature (Deutsch [13], Sztompka [54], Mui et al. [44], Olmedilla et al. [47], Grandison and Sloman [20], Artz and Gil [6]). A highly cited thesis on trust computation (Marsh [41]) provides theoretical perspectives of modeling trust, but its complex nature makes it very difficult to apply, especially to social networks (Golbeck and Hendler [18]). Trust between any two people is observed to be affected by many factors including past experiences, opinions expressed and actions taken, contributions to spreading rumors, influence by others opinions, and motives to gain something extra. Another important aspect of trust is the trustworthiness of user-generated content. Moturu and Liu [43] provided an intuitive scoring measure to quantify the trustworthiness of health-related user-generated content in social media. 3. Illustrative Examples of Mining Social Media In this section, we present some examples in our research to illustrate how to mine social media data to address novel research issues. The first example is about assessing user vulnerability on a social networking sites to maintain user privacy. The second example explores the importance of social and historical ties on a location-based social network. The third example introduces a method that can take advantage of multifaceted trust for predictive analytics.
8 Tutorials in Operations Research, c 2012 INFORMS 3.1. Assessing User Vulnerability on a Social Networking Site (Gundecha et al. [23]) Attribute-value data are a principal data form in social media. Attributes available for every user on a social networking site can be categorized into two major types: individual attributes and community attributes. Individual attributes are those attributes that contain individual user information. Individual attributes include personal information such as gender, birth date, phone number, home address, etc. Community attributes are those attributes that contain information about the friends of a user. Community attributes include friends that are traceable from a user s profile (i.e., a user s friends list), tagged pictures, and wall interactions. Using the privacy and security settings of a profile, a user can control the visibility of most individual attributes but has limited control over the visibility of most community attributes. For example, Facebook users these days can control photo tagging and the sharing of their friend list with the public but still can not control friends sharing their friend lists or uploading photos of them from their profiles to the public. Our earlier work (Gundecha et al. [23]) used a large-scale Facebook data set to assess attribute visibility, which can be used to obtain general behavior of Facebook users. For example, Facebook users do not usually disclose their mobile phone number. Hence, users that do disclose phone numbers have a propensity to vulnerability because they disclose more sensitive information in their profiles. A large portion of users are either not careful or not aware of consequences of their actions on the privacy information of their friends. Thus, protecting community attributes is important in protecting user privacy. A novel mechanism was proposed by Gundecha et al. [23] to enable users to protect against vulnerability. Often users on a social networking site are unaware that they could pose a threat to their friends because of their vulnerability. Gundecha et al. [23] showed that it is feasible to measure a user s vulnerability based on three factors: (1) the user s privacy settings that can reveal personal information, (2) the user s action on a social networking site that can expose his or her friends personal information, and (3) friends actions on a social networking site that can reveal the user s personal information. Based on these factors, Gundecha et al. [23] formally presented one of the earliest models for vulnerability reduction. They proposed a four-step procedure to estimate user vulnerability: (1) estimate risk to privacy due to individual attributes (referred to as the I-index), (2) estimate risk to privacy due to community attributes (referred to as the C-index), (3) estimate the visibility of a user based on the I-index and C-index (referred to as the P -index), and (4) estimate the user vulnerability based on the P -indexes of a user and his one-hop friends (referred to as the V -index). Figures 1(a) and 1(b) show the relationship between the I-index and C-index as well as that between the P -index and V -index for 100,000 randomly chosen Facebook users. Note that users are sorted in ascending order of their I-indexes and P -indexes. The x-axis and y-axis show users and their index values, respectively. A user s vulnerable friend is defined as a friend whose unfriending will lower the V -index score of a user. This definition of a vulnerable friend can be generalized to multiple vulnerable friends. Figure 2 shows a comparison of the V -index values of each user before and after the unfriending of the user s k most vulnerable friends. For each graph in Figure 2, the x-axis and y-axis indicate users and their V -index values, respectively. Without loss of generality, we sorted all users, in the ascending order based on the existing V -index values, before we plotted the graphs in Figure 2. We ran the experiment on 300,000 randomly selected users of the Facebook data set. Figure 2 shows the performance comparison of V -index values for each user before (+) and after ( ) unfriending the k most vulnerable friends. We ran the experiments for different values of k including 1, 2, 10, and 50. As expected, vulnerability decreases consistently as the value of k increases, as seen in Figures 2(a) 2(d).
Tutorials in Operations Research, c 2012 INFORMS 9 Figure 1. Relationship among index values for each user. (a) I-index and C-index for each user (b) P-index and V-index for each user 1.0 C-index I-index 1.0 V-index P-index 0.8 0.8 Index values 0.6 0.4 Index values 0.6 0.4 0.2 0.2 0 0 2.5 5.0 7.5 10 Users 10 4 0 0 2.5 5.0 7.5 10 Users 10 4 3.2. Exploring Social Historical Ties on Location-Based Social Networks (Gao et al. [16]) LBSNs have been a popular form of social media in recent years. They provide locationrelated services that allow users to check in at geographical locations and share such experience with their friends. Millions of check-in records in LBSNs contain rich social and geographical information and provide a unique opportunity for researchers to study users Figure 2. Performance comparison of V -index values for each user before (+) and after ( ) unfriending the k most vulnerable friends from his or her social network. (a) Most vulnerable (b) 2 most vulnerable M1-index V-index M2-index V-index 0.6 0.6 Index values 0.4 0.2 Index values 0.4 0.2 0 0 0 1 2 3 0 1 2 3 Users 10 5 Users 10 5 (c) 10 most vulnerable (d) 50 most vulnerable M10-index V-index M50-index V-index 0.6 0.6 Index values 0.4 0.2 Index values 0.4 0.2 0 0 1 2 Users 10 5 3 0 0 1 2 Users 10 5 3
10 Tutorials in Operations Research, c 2012 INFORMS social behavior from a spatial temporal aspect, which in turn enables a variety of services including place advertisement or recommendation, traffic forecasting, and disaster relief. To understand a user s check-in behavior, it is inevitable to perform a historical analysis of users. It is because the historical check-ins provide rich information about a user s interests and hints about when and where a particular user would like to go. In addition, social correlation theory suggests to consider users social ties, because human movement is usually affected by their social events, such as visiting friends, going out with colleagues, and so on. These two relationship ties can shape the user s check-in experience on LBSNs, and each tie gives rise to a different probability of check-in activity, which indicates that people in different spatial temporal social circles have different interactions. The historical ties of a user s check-in behavior have two properties on LBSNs. First, a user s check-in history approximately follows a power-law distribution; i.e., a user goes to a few places many times and to many places a few times. Second, the historical ties have a short-term effect. Taking advantage of the similarity between language modeling and location-based social network mining, the work of Gao et al. [16] introduced the Pitman Yor process to location-based social networks to model the historical ties of a user i for his checkin behavior c n+1 = l at time (n + 1) and location l, specifically, the power-law distribution and short-term effect of historical ties, denoted as historical model (HM) as shown below: P i H(c n+1 = l) = P i, i HP Y (c n+1 = l u, t ul, d u, r u, t u ), where u, t ul, d u, r u, and t u are parameters. A social historical model (SHM) is proposed to explore user i s check-in behavior integrating both of the social and historical effects: P i SH(c n+1 = l) = ηp i H(c n+1 = l) + (1 η)p i S(c n+1 = l), where P i S(c n+1 = l) = u j N (u i) sim(u i, u j )P i, j HP Y (c n+1 = l), where N indicates user i s set of neighbors. The experiments with location prediction on a real-world LBSN compare the proposed methods (HM and SHM) with some baseline methods. The results are plotted in Figure 3, demonstrating that the proposed methods best model users check-ins in terms of location prediction; in other words, social and historical ties can help location prediction. This work finds that a user s friends can influence his next location because users that have shared friends tend to go to similar locations than those without. The power-law property and short-term effect are observed in historical ties; thus, a historical model is introduced to capture these properties. The experimental results on location prediction demonstrate that the proposed approach is suitable in capturing a user s check-in property and outperforms current models. 3.3. Discerning Multifaceted Trust in a Connected World (Tang et al. [57]) The issue of trust has attracted increasing attention from the community of social media research. Trust, as a social concept, naturally has multiple facets, indicating multiple and heterogeneous trust relationships between users. Here is a multifaceted trust example from Epinions. Figure 4(a) shows single trust relationships between user 1 and his 20 friends. Here, we can see that user 7 is the more trustable for user 1. Figures 4(b) and 4(c) show their multifaceted trust relationships in the categories home and garden and restaurants, respectively. For the category home and garden, user 7 is not necessary the most trusted
Tutorials in Operations Research, c 2012 INFORMS 11 Figure 3. The performance comparison of prediction models demonstrates that by considering social and historical ties, the proposed models can help location prediction. Prediction accuracy 0.40 0.35 0.30 0.25 MFC MFT Order-1 Order-2 HM SHM 0.20 0.15 10 20 30 40 50 60 70 80 90 Fraction of training set Note. MFC, most frequent check-in model; MFT, most frequent time model; Order-1, order-1 Markov model; Order-2, order-2 Markov model. friend of user 1. This shows that trust relationships in different categories vary. Thus, people trust others differently in different facets. There are two challenges to study in obtaining multifaceted trust between users: first, the representation of multiple and heterogeneous trust relationships between users, and second, estimating the strength of multifaceted trust. Traditionally, trust is represented by an adjacency matrix. However, this cannot capture the multifaceted trust relations. Tang et al. [57] developed a new algorithm, mtrust, that extends a matrix representation to a tensor representation, adding an extra dimension for facet description. Previous work observed a strong correlation between trust and user similarity in the context of rating systems. Therefore, it is reasonable to embed trust strength inference in rating prediction. Thus, to evaluate the usefulness of multifaced trust, this work embeds the multifaceted trust inference in the framework of rating prediction. Interesting findings from the experiments are that (1) more than 20% of reciprocal links are heterogeneous, (2) more than 14% transitive trust relations are heterogeneous, and (3) more Figure 4. Single trust and multifaceted trust relationships of one user in Epinions. (a) Single trust (b) Trust in home and garden (c) Trust in restaurants 3 4 5 6 7 8 9 3 4 5 6 7 8 9 3 4 5 6 7 8 9 2 10 2 10 2 10 21 20 1 19 18 17 16 15 11 12 13 21 14 20 Pajek 1 19 18 17 16 15 11 12 13 14 Pajek 21 20 19 1 18 17 16 15 11 12 13 14 Pajek Note. The thickness of a line indicates the level of trust.
12 Tutorials in Operations Research, c 2012 INFORMS than 11% of cocitation trust relations are heterogeneous. With these findings, mtrust can be applied to many online tasks such as improving rating prediction, enabling facet-sensitive ranking, and making status theory applicable to reciprocal links. 4. Employing Social Media in Real-World Applications Social media has been increasingly used in a wide range of domains; examples include political campaigns (e.g., presidential elections), mass movements (e.g., organizing Occupy Wall Street movements, Arab Spring), as well as disaster and crisis response and relief coordination. In this section, we show how social media mining is used for HADR. Government and nongovernmental organizations (NGOs) have been faced with challenges to effectively respond to crises such as natural disasters (e.g., tsunamis, hurricanes, earthquakes). Providing efficient relief to the victims is a primary focus in saving lives, minimizing further losses in the aftermath, and accelerating recovery. Crowdsourcing tools via social media such as Twitter, Ushahidi, and Sahana have proven useful in gathering information during and after a crisis. However, they are designed to go beyond crowdsourcing. Additional capabilities are required for response coordination, secured collaboration, and trust building among relief organizations. Goolsby [19] pointed out the need for such a social media system and described the effort to build it to allow different organizations to share crowdsourced and groupsourced information, and analyze and visualize the processed information for intelligent decision making. In this section, we describe two social media prototype systems designed to facilitate efficient collaboration among disparate organizations for effective and coordinated responses to crises. 4.1. ASU Coordination Tracker (Gao et al. [17]) When natural disasters occur, the international community would join forces to provide disaster relief and humanitarian assistance. Some prominent examples include the Haiti earthquake and the subsequent cholera outbreak, and the devastating earthquake and tsunami in Japan. Social media has revolutionized the use of traditional media and played an important role in these events as an information collector and disseminator, and as a communication and collaboration tool. As one of the most important functions of social media, crowdsourcing is a collaborative information sharing mechanism based on the principle of collective wisdom. Wikipedia 10 is a perfect example of crowdsourcing, where people collectively publish information on various topics. Crowdsourcing is capable of leveraging participatory social media services and tools to collect information, and it allows crowds to participate in various HADR tasks. Its integration with crisis maps has been a very effective crowdsourcing application in HADR efforts. One of the earliest such systems was Alive in Afghanistan, 11 where people in Afghanistan submitted their reports on accidents and terrorist attacks across Afghanistan. Although crowdsourcing applications can provide accurate and timely information about a crisis for decision making, current crowdsourcing applications still fall short in supporting disaster relief efforts (Gao et al. [15]). Most importantly, current applications do not provide a common mechanism specifically designed for collaboration and coordination between disparate relief organizations. For example, relief organizations that work independently can cause conflicts and complicate the relief efforts. If independent organizations duplicate a response, it will draw resources away from the other areas in need that could use the duplicated supplies. It can also result in delayed response to other disaster areas. Furthermore, because of the noisy and chaotic nature of crowdsourced data, current crowdsourcing applications cannot provide readily useful information for disaster relief efforts. 10 http://www.wikipedia.org/ (accessed March 2012). 11 http://aliveinafghanistan.org/ (accessed March 2012).
Tutorials in Operations Research, c 2012 INFORMS 13 To address these shortfalls of crowdsourcing to a certain extent, an online coordination system, the ASU Coordination Tracker (ACT), was devised. The ACT is an event response coordination system with the primary goal of facilitating multiorganization response (military, governments, NGOs, etc.) to an event such as a disaster and providing relief organizations the means for better collaboration and coordination during a crisis. It is a speedy and effective approach with easy, open communication and streamlined coordination. The major features and advantages of the ACT are summarized below: leveraging crowdsourcing information to provide the means for a groupsourcing response for organizations to assist persons affected by a crisis effectively, developing novel strategies to analyze the requests and maximize the coordinate efforts of crowdsourced data for disaster relief, visualizing the requests to give a global view of the request distribution to facilitate responders, and increasing the coordination efficiency by optimally responding to requests. The ACT consists of five functional modules: request collection, request analysis, response, coordination, and situation awareness. Raw requests from users are collected via crowdsourcing and groupsourcing methods. The request analysis module takes advantage of both data mining technology and expert knowledge to iteratively capture the essential content of raw requests. Based on the demand and disaster background, raw requests are analyzed and classified into various categories. Essential contents extracted from the raw data are then stored in a requests pool. The system visualizes the requests pool on its crisis map, with its selective visibility bonded to the access levels of users (organizations). The response module is designed to allow relief organizations to contribute, receive, and evaluate different response options and plan response actions. Relief organizations are able to respond to requests through the crisis map directly. The coordination model uses the interagency concept (Goolsby [19]) to avoid response conflicts while maintaining centralized control. The ACT displays every available request on the crisis map. Each request is in one of the following four states: available, in process, in delivery, or delivered. To avoid conflicts, relief organizations are not able to respond to requests that are being fulfilled by another organization. A statistics module runs in the background to help track relief progress for situation awareness, relief strategy adjustments, and further decision making. The ACT is a developing system that enables coordination among organizations during disaster relief. We are investing approaches to improve collaboration efficiency and provide differential security to relief organizations and decision makers. The disaster relief ASU Crisis Response Game (Abbasi et al. [1]) has demonstrated that by both leveraging crowdsourcing information and providing means for a groupsourcing response, organizations can effectively assist victims. 4.2. TweetTracker (Kumar et al. [32]) TweetTracker 12 is a Twitter-based analytic and visualization tool. The focus of the tool is to help HADR relief organizations to acquire situational awareness during disasters and emergencies to aid disaster relief efforts (Kumar et al. [32]). New social media platforms, such as Twitter microblogs, demonstrate their value and cability to provide information that is not easily attainable from traditional media. For example, during the Mumbai blasts of 2011, 13 firsthand information from the affected region was available on Twitter moments after the blast. TweetTracker is designed to help to track, analyze, review, and monitor tweets. This is achieved through near-real-time tracking of tweets with specific keywords/hashtags and tweets generated from the region affected by the crisis. The tool supports monitoring and 12 http://tweettracker.fulton.asu.edu/ (accessed March 2012). 13 http://ibnlive.in.com/news/mumbai-blasts-twitter-joins-hands-to-help/167345-3.html (accessed March 2012).
14 Tutorials in Operations Research, c 2012 INFORMS analysis of the collected tweets via real-time trending, data reduction, historical review, and integrated data mining techniques. TweetTracker consists of three main components: (1) a Twitter stream reader, (2) a data storage module, and (3) a data mining and visualization module. The Twitter stream reader is a data collection module that continually crawls tweets through the Twitter streaming API (Application Programming Interface). 14 Tweets are filtered based on user-specified keywords, hashtags, and geolocations. The data storage module is responsible for storing and indexing the collected tweets into a relational database for use by the visualization module. The data mining and visualization module is a Web-based user interface to the collected tweets and a means to analyze the collected tweets. It provides geospatial visualization of tweets related to a particular event on a map, summarizes the tweets, and visualizes the trending keywords in the form of a word cloud, and it can identify popular resources (URLs) and users mentioned in the tweets. The tool also includes built-in language translation support for monitoring of multilingual tweets. TweetTracker has been used in tracking, visualizing, and analyzing activities including the Arab Spring movement, the Occupy Wall Street movement, and various natural disasters such as earthquakes and cholera outbreaks. 5. Summary Valuable information is hidden in vast amounts of social media data, presenting ample opportunities social media mining to discover actionable knowledge that is otherwise difficult to find. Social media data are vast, noisy, distributed, unstructured, and dynamic, which poses novel challenges for data mining. In this tutorial, we offer a brief introduction to mining social media, use illustrative examples to show that burgeoning social media mining is spearheading the social media research, and demonstrate its invaluable contributions to real-world applications. As a main type of big data, social media is finding its many innovative uses, such as political campaigns, job applications, business promotion and networking, and customer services, and using and mining social media is reshaping business models, accelerating viral marketing, and enabling the rapid growth of various grassroots communities. It also helps in trend analysis and sales prediction. Social media data will continue their rapid growth in the foreseeable future. We are faced with an increasing demand for new algorithms and social media mining tools. Existing preliminary success in social media mining research efforts convincingly demonstrates the promising future of the emerging social media mining community and will help to expand research and development and explore online and off-line human behavior and interaction patterns. Acknowledgments The authors thank Huiji Gao, Shamanth Kumar, Jiliang Tang, and DMML members for their assistance and feedback in preparing this tutorial. Some projects described in this brief introductory survey were sponsored by the Office of Naval Research [ONR N000141010091]; the Army Research Office [ARO 025071]; and the National Science Foundation [Grant 0812551]. References [1] M. A. Abbasi, S. Kumar, J. A. Andrade Filho, and H. Liu. Lessons learned in using social media for disaster relief ASU Crisis Response Game. Proceedings of the International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction. Springer-Verlag, Berlin, 282 289, 2012. [2] N. Agarwal and H. Liu. Modeling and Data Mining in Blogosphere. Morgan & Claypool Publishers, San Rafael, CA, 2009. 14 https://dev.twitter.com/docs/streaming-api (accessed March 2012).
Tutorials in Operations Research, c 2012 INFORMS 15 [3] N. Agarwal, H. Liu, L. Tang, and P. S. Yu. Identifying the influential bloggers in a community. Proceedings of the International Conference on Web Search and Web Data Mining. Association for Computing Machinery, New York, 207 218, 2008. [4] N. Agarwal, H. Liu, L. Tang, and P. S. Yu. Modeling blogger influence in a community. Social Network Analysis and Mining 2(2):139 162, 2012. [5] S. Aral, L. Muchnik, and A. Sundararajan. Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences of the United States of America 106(51):21544, 2009. [6] D. Artz and Y. Gil. A survey of trust in computer science and the Semantic Web. Web Semantics: Science, Services and Agents on the World Wide Web 5(2):58 71, 2007. [7] L. Backstrom and J. Leskovec. Supervised random walks: Predicting and recommending links in social networks. Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. Association for Computing Machinery, New York, 635 644, 2011. [8] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: Membership, growth, and evolution. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, 44 54, 2006. [9] R. Baden, A. Bender, N. Spring, B. Bhattacharjee, and D. Starin. Persona: An online social network with user-defined privacy. ACM SIGCOMM Computer Communication Review 39(4):135 146, 2009. [10] N. T. J. Bailey. The Mathematical Theory of Infectious Diseases and Its Applications. Charles Griffin, High Wycombe, UK, 1975. [11] E. Berger. Dynamic monopolies of constant size. Journal of Combinatorial Theory, Series B 83(2):191 200, 2001. [12] Z. Chu, S. Gianvecchio, H. Wang, and S. Jajodia. Who is tweeting on Twitter: Human, bot, or cyborg? Proceedings of the 26th Annual Computer Security Applications Conference. Association for Computing Machinery, New York, 21 30, 2010. [13] M. Deutsch. Cooperation and trust: Some theoretical notes. Nebraska Symposium on Motivation. University of Nebraska Press, Lincoln, 1962. [14] L. Fang and K. LeFevre. Privacy wizards for social networking sites. Proceedings of the 19th International Conference on World Wide Web. Association for Computing Machinery, New York, 351 360, 2010. [15] H. Gao, G. Barbier, and R. Goolsby. Harnessing the crowdsourcing power of social media for disaster relief. IEEE Intelligent Systems 26(3):10 14, 2011. [16] H. Gao, J. Tang, and H. Liu. Exploring social-historical ties on location-based social networks. Proceedings of the 6th International AAAI Conference on Weblogs and Social Media. Association for the Advancement of Artificial Intelligence, Palo Alto, CA, 2012. [17] H. Gao, X. Wang, G. Barbier, and H. Liu. Promoting coordination for disaster relief: From crowdsourcing to coordination. Proceedings of the 4th Conference on Social Computing, Behavioral-Cultural Modeling and Prediction. Springer-Verlag, Berlin, 197 204, 2011. [18] J. Golbeck and J. Hendler. Inferring binary trust relationships in web-based social networks. ACM Transactions on Internet Technology 6(4):497 529, 2006. [19] R. Goolsby. Social media as crisis platform: The future of community maps/crisis maps. ACM Transactions on Intelligent Systems and Technology 1(1):1 11, 2010. [20] T. Grandison and M. Sloman. A survey of trust in Internet applications. IEEE Communications Surveys & Tutorials 3(4):2 16, 2009. [21] M. Granovetter. Threshold models of collective behavior. American Journal of Sociology 83(6):1420 1443, 1978. [22] R. Gross and A. Acquisti. Information revelation and privacy in online social networks. Proceedings of the 2005 ACM Workshop on Privacy in the Electronic Society. Association for Computing Machinery, New York, 71 80, 2005. [23] P. Gundecha, G. Barbier, and H. Liu. Exploiting vulnerability to secure user privacy on a social networking site. The 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, 2011.
16 Tutorials in Operations Research, c 2012 INFORMS [24] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, 2011. [25] M. Hu and B. Liu. Mining and summarizing customer reviews. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, 168 177, 2004. [26] N. Jindal and B. Liu. Identifying comparative sentences in text documents. Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, 244 251, 2006. [27] N. Jindal and B. Liu. Opinion spam and analysis. Proceedings of the International Conference on Web Search and Web Data Mining. Association for Computing Machinery, New York, 219 230, 2008. [28] A. M. Kaplan and M. Haenlein. Users of the world, unite! The challenges and opportunities of social media. Business Horizons 53(1):59 68, 2010. [29] D. Kempe, J. Kleinberg, and É. Tardos. Maximizing the spread of influence through a social network. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, 137 146, 2003. [30] I. Konstas, V. Stathopoulos, and J. M. Jose. On social networks and collaborative recommendation. Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, 195 202, 2009. [31] B. Krishnamurthy and C. E. Wills. On the leakage of personally identifiable information via online social networks. ACM SIGCOMM Computer Communication Review 40(1):112 117, 2010. [32] S. Kumar, R. Zafarani, and H. Liu. Understanding user migration patterns across social media. Twenty-Fifth International Conference on Artificial Intelligence. Association for the Advancement of Artificial Intelligence, Palo Alto, CA, 2011. [33] T. La Fond and J. Neville. Randomization tests for distinguishing social influence and homophily effects. Proceedings of the 19th International Conference on World Wide Web. Association for Computing Machinery, New York, 601 610, 2010. [34] P. F. Lazarsfeld and R. K. Merton. Friendship as a social process: A substantive and methodological analysis. Freedom and Control in Modern Society 18:18 66, 1954. [35] B. Liu. Sentiment analysis and subjectivity. Handbook of Natural Language Processing. CRC Press, Boca Raton, FL, 627 666, 2010. [36] H. Liu and P. Maes. InterestMap: Harvesting social network profiles for recommendations. Workshop: Beyond Personalization, San Diego, 2005. [37] H. Liu and H. Motoda. Computational Methods of Feature Selection. Chapman & Hall, Boca Raton, FL, 2008. [38] H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King. Recommender systems with social regularization. Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. Association for Computing Machinery, New York, 287 296, 2011. [39] M. W. Macy. Chains of cooperation: Threshold effects in collective action. American Sociological Review 56(6):730 747, 1991. [40] V. Mahajan, E. Muller, and F. M. Bass. New product diffusion models in marketing: A review and directions for research. Journal of Marketing 54(1):1 26, 1990. [41] S. P. Marsh. Formalising trust as a computational concept. Ph.D. thesis, Deptartment of Computing Science and Mathematics, University of Stirling, Stirling, UK, 1994. [42] M. McPherson, L. Smith-Lovin, and J. M. Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology 27:415 444, 2001. [43] S. T. Moturu and H. Liu. Quantifying the trustworthiness of social media content. Distributed and Parallel Databases 29(3):239 260, 2011. [44] L. Mui, M. Mohtashemi, and A. Halberstadt. A computational model of trust and reputation for E-businesses. Proceedings of the 35th Annual Hawaii Conference on System Sciences (HICSS 02). IEEE Computer Society, Washington, DC, 2431 2439, 2002.
Tutorials in Operations Research, c 2012 INFORMS 17 [45] A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. Proceedings of the 2008 IEEE Symposium on Security and Privacy. IEEE Computer Society, Washington, DC, 111-125, 2008. [46] A. Narayanan and V. Shmatikov. De-anonymizing social networks. Proceedings of the 2009 IEEE Symposium on Security and Privacy. IEEE Computer Society, Washington, DC, 173 187, 2009. [47] D. Olmedilla, O. Rana, B. Matthews, and W. Nejdl. Security and trust issues in semantic grids. Proceedings of the Dagsthul Seminar, Semantic Grid: The Convergence of Technologies 5271:896 902, 2005. [48] B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and Trends R in Information Retrieval 2(1 2):1 135, 2008. [49] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, Vol. 10. Association for Computational Linguistics, Stroudsburg, PA, 79 86, 2002. [50] A. M. Popescu and O. Etzioni. Extracting product features and opinions from reviews. Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, 339 346, 2005. [51] E. Riloff and J. Wiebe. Learning extraction patterns for subjective expressions. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, 105 112, 2003. [52] D. Rosenblum. What anyone can know: The privacy risks of social networking sites. IEEE Security and Privacy 5(3):40 49, 2007. [53] A. C. Squicciarini, M. Shehab, and F. Paci. Collective privacy management in social networks. Proceedings of the 18th International Conference on World Wide Web. Association for Computing Machinery, New York, 521 530, 2009. [54] P. Sztompka. Trust: A Sociological Theory. Cambridge University Press, Cambridge, UK, 1999. [55] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson Addison Wesley, Boston, 2006. [56] L. Tang and H. Liu. Community Detection and Mining in Social Media, Vol. 2. Morgan & Claypool Publishers, San Rafael, CA, 2010. [57] J. Tang, H. Gao, and H. Liu. mtrust: Discerning multi-faceted trust in a connected world. Proceedings of the Fifth ACM International Conference on Web Search and Data Mining. Association for Computing Machinery, New York, 93 102, 2012. [58] P. D. Turney. Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classification of reviews. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, 417 424, 2002. [59] I. H. Witten, E. Frank, and M. A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2011. [60] G. Wondracek, T. Holz, E. Kirda, and C. Kruegel. A practical attack to de-anonymize social network users. Proceedings of the 2010 IEEE Symposium on Security and Privacy. IEEE Computer Society, Washington, DC, 223 238, 2010. [61] S. Yardi, D. Romero, G. Schoenebeck, and D. Boyd. Detecting spam in a Twitter network. First Monday 15(1):1 4, 2009. [62] H. Yu and V. Hatzivassiloglou. Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, 129 136, 2003. [63] Z. A. Zhao and H. Liu. Spectral Feature Selection for Data Mining. Chapman & Hall/CRC Press, Virginia Beach, VA, 2012. [64] E. Zheleva and L. Getoor. To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles. Proceedings of the 18th International Conference on World Wide Web. Association for Computing Machinery, New York, 531 540, 2009.