2.1 Unconstrained Graph Partitioning. 1.2 Contributions. 1.3 Related Work. 1.4 Paper Organization 2. GRAPH-THEORETIC APPROACH



Similar documents
10 Evaluating the Help Desk

Candidate: Shawn Mullane. Date: 04/02/2012

Candidate: Suzanne Maxwell. Date: 09/19/2012

8 Service Level Agreements

11 Success of the Help Desk: Assessing Outcomes

7 Help Desk Tools. Key Findings. The Automated Help Desk

GUIDELINE. Guideline for the Selection of Engineering Services

9 Setting a Course: Goals for the Help Desk

6 Funding and Staffing the Central IT Help Desk

Using GPU to Compute Options and Derivatives

Introduction to HBase Schema Design

TrustSVD: Collaborative Filtering with Both the Explicit and Implicit Influence of User Trust and of Item Ratings

Corporate performance: What do investors want to know? Innovate your way to clearer financial reporting

Candidate: Cassandra Emery. Date: 04/02/2012

Candidate: Kevin Taylor. Date: 04/02/2012

Candidate: Kyle Jarnigan. Date: 04/02/2012

Mining Social Media with Social Theories: A Survey

Designing an Authentication Strategy

Document management and records (based in part upon materials by Frank Upward and Robert Hartland)

An unbiased crawling strategy for directed social networks

On a Generalized Graph Coloring/Batch Scheduling Problem

Curriculum development

Dimension Debasing towards Minimal Search Space Utilization for Mining Patterns in Big Data

Compensation Approaches for Far-field Speaker Identification

Planning a Smart Card Deployment

Every manufacturer is confronted with the problem

Enabling Advanced Windows Server 2003 Active Directory Features

Data De-duplication from the data sets using Similarity functions

The Boutique Premium. Do Boutique Investment Managers Create Value? AMG White Paper June

Designing and Deploying File Servers

Deploying Network Load Balancing

Spectrum Balancing for DSL with Restrictions on Maximum Transmit PSD

Motorola Reinvents its Supplier Negotiation Process Using Emptoris and Saves $600 Million. An Emptoris Case Study. Emptoris, Inc.

Candidate: Charles Parker. Date: 01/29/2015

Planning a Managed Environment

Effective governance to support medical revalidation

ASAND: Asynchronous Slot Assignment and Neighbor Discovery Protocol for Wireless Networks

A Contemporary Approach

Optimal Trust Network Analysis with Subjective Logic

WHITE PAPER. Filter Bandwidth Definition of the WaveShaper S-series Programmable Optical Processor

A taxonomy of knowledge management software tools: origins and applications

Introducing Revenue Cycle Optimization! STI Provides More Options Than Any Other Software Vendor. ChartMaker Clinical 3.7

Herzfeld s Outlook: Seasonal Factors Provide Opportunities in Closed-End Funds

NAPA TRAINING PROGRAMS FOR:

Closer Look at ACOs. Making the Most of Accountable Care Organizations (ACOs): What Advocates Need to Know

Roth 401(k) and Roth 403(b) Accounts: Pay Me Now or Pay Me Later Why a Roth Election Should Be Part of Your Plan Now

CRM Customer Relationship Management. Customer Relationship Management

Optimal Personalized Filtering Against Spear-Phishing Attacks

The Intelligent Choice for Basic Disability Income Protection

Closer Look at ACOs. Designing Consumer-Friendly Beneficiary Assignment and Notification Processes for Accountable Care Organizations

STI Has All The Pieces Hardware Software Support

KEYS TO BEING AN EFFECTIVE WORKPLACE PERSONAL ASSISTANT

Research on Pricing Policy of E-business Supply Chain Based on Bertrand and Stackelberg Game

Practical Tips for Teaching Large Classes

Planning and Implementing An Optimized Private Cloud

Facilities. Car Parking and Permit Allocation Policy

Planning an Active Directory Deployment Project

FINANCIAL FITNESS SELECTING A CREDIT CARD. Fact Sheet

Inferring Continuous Dynamic Social Influence and Personal Preference for Temporal Behavior Prediction

The Good Governance Standard for Public Services

Regular Specifications of Resource Requirements for Embedded Control Software

The Intelligent Choice for Disability Income Protection

Member of the NKT Group. We connect renewable energy sources. Onshore, offshore and photovoltaic

Purposefully Engineered High-Performing Income Protection

Sickness Absence in the UK:

Research on Staff Explicitation in Organizational Knowledge Management Based on Fuzzy Set Similarity to Ideal Solution

High Availability for Microsoft SQL Server Using Double-Take 4.x

The Good Governance Standard for Public Services

High Availability for Internet Information Server Using Double-Take 4.x

CRM Customer Relationship Management. Customer Relationship Management

Executive Coaching to Activate the Renegade Leader Within. Renegades Do What Others Won t To Get the Results that Others Don t

Kentucky Deferred Compensation (KDC) Program Summary

Evolutionary Path Planning for Robot Assisted Part Handling in Sheet Metal Bending

The Time is Now for Stronger EHR Interoperability and Usage in Healthcare

5 High-Impact Use Cases of Big Data Analytics for Optimizing Field Service Processes

SYSTEM OF CONFORMITY ASSESSMENT SCHEMES FOR ELECTROTECHNICAL EQUIPMENT

Designing a TCP/IP Network

I Symbolization J,1 II e L~ "-"-:"u"'dll... Table I: The kinds of CGs and their classification, (where, t - a local neighbourhood topology)

The bintec HotSpot Solution. Convenient internet access anywhere

Towers Watson Manager Research

Position paper smart city. economics. a multi-sided approach to financing the smart city. Your business technologists.

A Spare Part Inventory Management Model for Better Maintenance of Intelligent Transportation Systems

Make the College Connection

FaceTrust: Assessing the Credibility of Online Personas via Social Networks

EMC ViPR. Concepts Guide. Version

Periodized Training for the Strength/Power Athlete

A Novel QR Code and mobile phone based Authentication protocol via Bluetooth Sha Liu *1, Shuhua Zhu 2

Introducing ChartMaker Cloud! STI Provides More Options Than Any Other Software Vendor

The Institute Of Commercial Management. Prospectus. Start Your Career Here!

Social Work Bursary: Academic year 2015/16 Application notes for students on undergraduate courses

Welcome to UnitedHealthcare. Ideally, better health coverage should cost less. In reality, now it can.

HSBC Internet Banking. Combined Product Disclosure Statement and Supplementary Product Disclosure Statement

How To Plan A Cloud Infrastructure

Social Work Bursary: Academic Year 2014/15 Application notes for students on postgraduate courses

Resource Pricing and Provisioning Strategies in Cloud Systems: A Stackelberg Game Approach


Opening the Door to Your New Home

PRIN 2007: D-ASAP project


Transcription:

Mining Newsgrops Using Networks Arising From Social Behavior Rakesh Agrawal Sridhar Rajagopalan Ramakrishnan Srikant Yirong X IBM Almaden Research Center 6 Harry Road, San Jose, CA 95120 ABSTRACT Recent advances in information retrieval over hyperlinked corpora have convincingly demonstrated that links carry less noisy information than text. We investigate the feasibility of applying link-based methods in new applications domains. The specific application we consider is to partition athors into opposite camps within a given topic in the context of newsgrops. A typical newsgrop posting consists of one or more qoted lines from another posting followed by the opinion of the athor. This social behavior gives rise to a network in which the vertices are individals and the links represent responded-to relationships. An interesting characteristic of many newsgrops is that people more freqently respond to a message when they disagree than when they agree. This behavior is in sharp contrast to the WWW link graph, where linkage is an indicator of agreement or common interest. By analyzing the graph strctre of the responses, we are able to effectively classify people into opposite camps. In contrast, methods based on statistical analysis of text yield low accracy on sch datasets becase the vocablary sed by the two sides tends to be largely identical, and many newsgrop postings consist of relatively few words of text. Categories and Sbject Descriptors H.2.8 [Database Management]: Database Applications Data Mining; J.4 [Compter Applications]: Social And Behavioral Sciences; G.2.2 [Discrete Mathematics]: Graph Theory Graph Algorithms General Terms Algorithms Keywords Data Mining, Web Mining, Text Mining, Link Analysis, Newsgrop, Social Network 1. INTRODUCTION Information retrieval has recently witnessed remarkable advances, feled almost entirely by the growth of the web. The fndamental featre distingishing recent forms of information retrieval from the classical forms [25] is the pervasive se of link information [3] [5]. Drawing pon the sccess of search engines that se linkage information extensively, we stdy the feasibility of extending linkbased methods to new application domains. Interactions between individals have two components: Copyright is held by the athor/owner(s). WWW2003, May 20 24, 2003, Bdapest, Hngary. ACM 1-58113-6-3/03/0005. 1. The content of the interaction the text. 2. The choice of person who an individal chooses to interact with the link. Making determinations abot vales, opinions, biases and jdgments prely from a statistical analysis of text is hard: these operations reqire a more detailed lingistic analysis of content [24] [28]. However, relatively simple methods applied to the link graph might discern a great deal. 1.1 Application Domain We selected the mining of newsgrop discssions as or sandbox for this stdy. We chose this domain for the following reasons. Newsgrops are rich sorces of pblicly available discorse on any conceivable topic. For instance, Google has integrated the past 20 years of Usenet archives into Google Grops (grops.google.com), and offers access to more than 0 million messages dating back to 1981. Newsgrops discssions are mostly open, frank, and nadlterated. Newsgrop postings can provide a qick plse on any topic. Finally, extracting sefl information from a newsgrop sing conventional text mining techniqes has been hard becase the vocablary sed in the two sides of an isse is generally identical and becase individal postings tend to be sparse. The strctre of newsgrop postings Newsgrop postings tend to be largely discssion oriented. A newsgrop discssion on a topic typically consists of some seed postings, and a large nmber of additional postings that are responses to a seed posting or responses to responses. Responses typically qote explicit passages from earlier postings. We can se these qoted passages to infer an implicit social network between individals participating in the newsgrop. DEFINITION 1 (QUOTATION LINK). There is a qotation link between person and person if has qoted from an earlier posting written by. Qotation links have several interesting social characteristics. First, they are created withot mtal concrrence: the person qoting the text does not need the permission of the athor to qote. Second, in many newsgrops, qotation links are sally antagonistic : it is more likely that the qotation is made by a person

A challenging or rebtting it rather than by someone spporting it. In this sense, qotation links are not like the web where linkage tends to imply a tacit endorsement. 1.2 Contribtions Or target application is the categorization of the athors on a given topic into two classes: those who are for the topic and those who are against. We develop a graph-theoretic algorithm for this task that completely disconts the text of the postings and only ses the link strctre of the network of interactions. Or empirical reslts show that the link-only algorithm achieves high accracy on this hard problem. Figre 1 shows an example of the reslts prodced by or algorithm when applied to social network extracted from newsgrop discssions on the claim handling of a specific insrance company. 1.3 Related Work The work of pioneering social psychologist Milgram [19] set the stage for investigations into social networks and algorithmic aspects of social networks. There have been more recent efforts directed at leveraging social networks algorithmically for diverse prposes sch as expertise location [17], detecting frad in celllar commnications [9], and mining the network vale of cstomers [8]. In particlar, Schwartz and Wood [26] constrct a graph sing email as links, and analyze the graph to discover shared interests. While their domain (like ors) consists of interactions between people, their links are indicators of common interest, not antagonism. Related research incldes work on incorporating the relationship between objects into the classification process. Chakrabarti et al. [6] showed that incorporating hyperlinks into the classifier can sbstantially improve the accracy. The work by Neville and Jensen [22] classifies relational data sing an iterative method where properties of related objects are dynamically incorporated to improve accracy. These properties inclde both known attribtes and attribtes inferred by the classifier in previos iterations. Other work along these lines inclde co-learning [2] [20] and probabilistic relational models [10]. Also related is the work on incorporating the clstering of the test set (nlabeled data) when bilding the classification model [13] [23]. Pang et al. [24] classify the overall sentiment (either positive or negative) of movie reviews sing text-based classification techniqes. Their domain appears to have sfficient distingishing words between the classes for text-based classification to do reasonably well, thogh interestingly they also note that common vocablary between the two sides limits classification accracy. 1.4 Paper Organization The rest of the paper is organized as follows. In Section 2, we give a graph theoretic formlation of the problem. We present reslts on real-life datasets in Section 3, inclding a comparison with alternate approaches. Section 4 describes sensitivity experiments sing synthetic data. We conclde with a smmary and directions for ftre work in Section 5. 2. GRAPH-THEORETIC APPROACH Consider a graph where the vertex set has a vertex per participant within the newsgrop discssion. Therefore the total nmber of vertices in the graph is eqal to the nmber of distinct participants. An edge,!, indicates that person " has responded to a posting by person #. 2.1 Unconstrained Graph Partitioning Optimm Partitioning Consider any bipartition of the vertices into two sets $ and %, representing those for and those against an isse. We assme $ and % to be disjoint and complementary, i.e., $'&(%)* and $'+%*-,. Sch a pair of sets can be associated with the ct fnction,./ $0 1%2 43 5+6 $*76%23, the nmber of edges crossing from $ to %. If most edges in a newsgrop graph represent disagreements, then the following holds: PROPOSITION 1. The optimm choice of $ and % maximizes./ $8 %. and %, the edges 9+: $;7-% are those that represent antagonistic responses, and the remainder of the edges represent reinforcing interactions. Therefore, if one were to have an algorithm for compting $ and % optimizing. as above, then we wold have a graph theoretic approach to classifying people in the newsgrop discssions based solely on link information. This problem, known as the max ct problem, is known to be NP-complete, and indeed was one of those shown to be so by Karp in his landmark paper [15]. The sitation on the problem remained nchanged ntil 1995, when Goemans and Williamson [11] introdced the idea of sing methods from Semidefinite Programming to approximate the soltion with garanteed bonds on the error better than the naive vale of <=. Semidefinite programming methods involve a lot of machinery, and in practice, their efficacy is sometimes qestioned [14]. For sch a choice of $ Efficient Soltion Rather than sing semidefinite approaches, we will resort to spectral partitioning [27] for comptational efficiency reasons. In doing so, we exploit particlarly two additional facts that hold in or sitation: 1. Rather than being a general graph, we have an instance which is largely a bipartite graph with some noise edges added. 2. Neither side of the bipartite graph is mch smaller than the other, i.e., it is not the case that 3 $>3"??*3 %@3 or vice versa. In sch sitations, we can transform the problem into a minweight approximately balanced ct problem, which in trn can be well approximated by comptationally simple spectral methods. We detail this process next. Consider the co-citation matrix of the graph. This graph, is a graph on the same set of vertices as. There B C is a weighted edge DE FGHÏFJK A in of weight L if and only if there are exactly L vertices /MMNMO sch that each edge FGH P and F is in. In other words, L measres the nmber of people that F and F have both responded to. Clearly, L can be sed as a measre of similarity, and we can se spectral (or any other) clstering methods to clster the vertex set into classes. This leads to the following observations. OBSERVATION A 1 (EV ALGORITHM). The second eigenvector of Q C is a good approximation of the desired bipartition of. OBSERVATION 2 (EV+KL ALGORITHM). Kernighan-Lin heristic [18] on top of spectral partitioning can improve the qality of partitioning [16].

R V Figre 1: Claim handling of an insrance company. 2.2 Constrained Graph Partitioning A limitation of the approach discssed so far is that while high accracy implies high ct weight, the reverse may not be necessarily tre. We may need to embellish or partitioning algorithms with some focsing to ndge them towards partitions that have high ct weight as well as high accracy. The basic idea will be to manally categorize a small nmber of prolific posters and tag the corresponding vertices in the graph. We then se this information to bootstrap a better overall partitioning by enforcing the constraint that those classified on one side by hman effort shold remain on that side dring the algorithmic partitioning of the graph. We can now define the constrained graph partitioning problem: PROBLEM 1 (CONSTRAINED PARTITIONING). Given a graph, and two sets of vertices S8T and S0U, constrained to be in the sets and W R respectively, find a bipartition of that respects this con- V8Z W2[. straint bt otherwise optimizes X/Y We can achieve the above partitioning by sing a simple artifice: all the positive vertices are condensed into a single vertex, and similarly for the negative vertices, before rnning the partitioning algorithm. Since EV and EV+KL algorithms cannot split the single vertex, all we need to check is that the final reslt has these two special vertices on the correct sides. The corresponding algorithms will be referred to as the Constrained EV and Constrained EV+KL algorithms. 3. EMPIRICAL EVALUATION We next present the reslts of or empirical evalation of the link-only approach. 3.1 Experimental Setp 3.1.1 Raw Data We created three datasets from the archives of the Usenet postings: \ Abortion: The dataset consists of the 13,642 postings in talk.abortion that contain the words Roe and Wade. \ Gn : The dataset consists of the 12,029 postings in talk.politics.gns that inclde the words gn, control, and opinion. \ Immigration: The dataset consists of the 10,285 postings in alt.politics.immigration that inclde the word jobs. 3.1.2 Extracting Social Network As stated in Section 1, links are implicit in the postings in the form of in response to tags. We retained from the raw data only those postings that contained both the athor and the person whom the athor was responding to. Each sch posting yields a link. We were ths able to se 65% to % of the postings (see Figre 2). The remaining postings were either not responses, or we were not able to find or match the name of the original poster. Becase of the sampled natre of the data, if we were to form the graph corresponding to all the responses, parts of the graph wold contain too little information to be analyzed effectively [26]. Each data set contained a core connected component (see Figre 3) that comprised almost all the postings within the data set. We, therefore, omitted all vertices (and corresponding docments) that do not connect to the core component. Figre 3 also shows the total nmber of athors and the average nmber of postings per athor

Abortion Gn Immigration Postings 13,642 12,029 10,285 Extracted Responses 10,917 8,640 6,691... as % of Postings % 72% 65% Figre 2: Nmber of response postings extracted from the raw data. Abortion Gn Immigration Responses 10,917 8,640 6,691 Retained Responses (R) 10,821 8,3 6,473 Athors (A) 2,525 2,632 1,648 Ratio (R/A) 4.3 3.2 3.9 Figre 3: Characteristics of the core component. Between 97% and 99% of the extracted responses were in a single connected sbgraph. Abortion Gn Immigration Positive 29 14 28 Negative 22 36 24 Figre 5: Distribtion of People in the Test Data. Abortion Gn Immig- ration (a) Majority Assignment 57% 72% 54% (b) SVMTorch II 55 42 55 Naive Bayes 72 54 (c) Iterative Classification 67 83 (d) EV 73 78 EV+KL 75 74 52 (e) Constrained EV 73 84 88 Constrained EV+KL 73 82 88 Figre 6: Accracy (in percentage). in the core component. 3.1.3 Links To nderstand the natre of links in the newsgrops, we manally examined responses (split eqally between the three datasets). We fond two interesting behaviors: ] The relationship between the two individals in the newsgrop network is mch more likely to be antagonistic than reinforcing. ] Many athors go off-topic and case the discssion to drift off to an nrelated sbject. We smmarize the reslts in Figre 4. Overall, arond 74% of the responses were antagonistic, 7% reinforcing, and 19% off-topic. The reslts were qite niform across all 3 datasets. 3.1.4 Test Data For each dataset, we manally tagged abot random people (not postings) into positive and negative categories to generate the test set for or experiments. Figre 5 gives the nmber of people in each category for the three datasets. 3.2 Reslts Before presenting reslts for the graph-theoretic approach, we first give reslts obtained sing two classification-basedapproaches: text classification and iterative classification. These reslts shold be viewed primarily as reference points for whether link-based methods can yield good reslts. 3.2.1 Text-based Classification Abortion Gn Immigration Overall Total 34 33 33 Antagonistic 26 23 25 74 Reinforcing 1 4 2 7 Off-topic 7 6 6 19 Figre 4: Analysis of responses. We experimented with two commonly sed text classification methods: Spport Vector Machines [4] [29] and Naive Bayes classifiers [12] [21]. For SVM, we sed Ronan Collobert s SVMTorch II [7]. For Naive Bayes, we sed the implementation described in [1]. The training set for text-based classification was obtained by taking the postings of each of the manally tagged ser and treating them as one docment (after dropping qoted sections). The 10- fold cross-validation reslts for both Naive Bayes and SVM are shown in Figre 6(b). We also show in Figre 6(a) the accracy for the three datasets if everyone was simply assigned to the majority class. Both SVM and Naive Bayes were nable to distingish the two classes, for any of the datasets. The only apparent exception is the 72% accracy on Gn with Naive Bayes; however, this reslt was obtained by classifying all examples into the majority class. The reason for the low accracy is that in a newsgrop discssion the vocablary sed by two sides is qite similar. Meaningfl words are contained eqally freqently in both the positive and negative classes, and the distingishing words are meaningless. 1 3.2.2 Iterative Classification Next, we experimented with an algorithm based on the iterative classification proposed in [22]. The proposal in [22] sed both the text as well as the link strctre. However, given the inadeqate accracy of the text-based classifiers, we experimented with a prely link-based version, which we describe next. Let the total nmber of iterations be ^. In each iteration _ : 1. Use links from labeled data to predict class labels on nlabeled data. 2. Sort predicted labels by confidence. 3. Accept ` class labels, where `*acbed_ fh^dg, and b is the nmber of instances in the test data. Let hjilk be the weight of the link between vertices m#i and mnk. Let the vertices in the training set have scores of either +1 or -1 1 More sophisticated methods may potentially improve the accracy of text-based classification. Unlike link-based methods, these methods can also classify loners people who neither respond to other postings nor have any responses to their postings.

t ~ (depending on their class). The score for labeled vertices in the test set is their score in the previos iteration; the score for nlabeled vertices is 0. The score o for a vertex p"q in the test set is compted as: orp qs := tvxw orp szy({ q {jq The sign of orp#q s gives the predicted class label, and orp#q s gives the confidence of the prediction. Figre 6(c) gives the 10-fold cross-validation reslts for this algorithm. Clearly, iterative classification sing the link strctre does mch better than the text-based classification. In some sense, this algorithm can be considered as a heristic graph partitioning algorithm. Ths the qestion is: will the traditional graph partitioning algorithms do better than this heristic approach? 3.2.3 Unconstrained Partitioning Figre 6(d) shows accracy reslts obtained sing the nconstrained EV and EV+KL algorithms for the three datasets. Unlike the two earlier approaches, nconstrained partitioning does not need the manally-labelled postings as training data; ths the accracy was compted simply by checking how many labeled individals were assigned correct partitions. Note that the iterative classification algorithm shold really be compared to the constrained graph partitioning algorithm (that also ses training data), not the nconstrained graph partitioning. Nevertheless, it is qite interesting that even nconstrained graph partitioning does qite well on two of the datasets: Abortion and Gn. It does abot as well as iterative classification, and mch better than text-based classification. We will explain shortly why the accracy obtained is not high for the third dataset (Immigration). KL marginally improves EV for Abortion and Immigration, bt degrades EV for Gn. Althogh the reslts for Gn are not that mch better than assigning everything to the majority class, these reslts were not obtained by assigning everything to the majority class. In fact, all the errors were de to the misassignments of majority class to minority class. This phenomenon can be explained as follows. When the distribtion of classes is skewed, it is qite likely that there will be some majority class people who will interact more with other majority class people than with the minority class. So they will be misclassified. On the other hand, most of the minority class people will be correctly classified since they will interact mch more with the majority class. To nderstand why EV did not do well on Immigration, we ran KL 0 times starting with different random partitions, and measred ct weight verss accracy. The reslts are shown in Figre 7. This figre confirms the observation we made in Section 2 that high accracy implies high ct weight, bt not vice versa. For Immigration, accracy for a ct weight of 84% ranges between % and 87%. We therefore need to se constrained partitioning to ndge EV towards partitions that have high ct weight as well as high accracy. 3.2.4 Constrained Graph Partitioning Figre 6(e) shows the reslt of 10-fold cross-validation obtained by choosing different small sbsets of the hand classified vertices as the manally tagged ones. We see that the accracy reslts (compared to nconstrained graph partitioning) are now abot the same Accracy Immigration 76 78 82 84 86 88 Ct Weight Figre 7: While high accracy implies high ct weight, the reverse is not tre. } Nmber of sers 0 Average postings per ser 3 for Zipf distribtion of postings 0.8 Fraction of sers who are Positive 0.6 Prity 0.8 Figre 8: Baseline parameters for the synthetic data experiments. In each case, when we varied one or other of or parameters, the others were kept fixed at the baseline levels. for Abortion, better for Gn, and dramatically better for Immigration. Constrained graph partitioning does consistently (4% to 6%) better than iterative classification on all three datasets. 4. SENSITIVITY EXPERIMENTS In this section, we explore the impact of varios data set parameters on the performance of the algorithms. Since it is infeasible to obtain real world data for a variety of parameter settings, we resort to synthetic data. We do this stdy largely to verify that the proposed algorithms are not dependent on some pecliarities of the datasets sed in the experiments, and do work nder a variety of settings. 4.1 Synthetic Data Generation Figre 8 gives the baseline parameters sed to create the synthetic datasets. We chose to make or classes nbalanced, i.e. % of the sers are chosen in the majority (in or case, positive) class. Prity is the fraction of links that cross from one class to the other; i.e. prity determines the expected nmber of antagonistic links in the network. The following is a description of the data generation algorithm. 1. For each athor p, the nmber of responses ƒ that p posts is a random variable drawn from a Zipf distribtion [30] with mean ~ and theta. All 3 real datasets follow a Zipf distribtion for the nmber of postings verss rank of athor, as shown in Figre 9 for the Gn dataset. 2. Randomly set fraction of athors as for and the remaining as against. 3. For each athor, select the other sers this athor responds to. Let athor p have ƒ postings assigned in Step 1. For each of the ƒ postings:

# Postings 0 10 Gn "real/gn-post-rf.data" 0/x^0.8 Accracy Con Ev+KL Con Ev Ev+KL Ev 1 1 10 0 Rank Figre 9: Prolificity follows a Zipf distribtion. 0.5 0.6 0.7 0.8 0.9 1 Prity Figre 11: Accracy follows prity when the nmber of postings is small. Accracy Con Ev+KL Con Ev Ev+KL Ev Accracy 10 Postings 5 Postings 3 Postings 2 Postings 1 2 3 4 5 6 7 8 9 10 Nmber of Postings Figre 10: Accracy rapidly increases with the nmber of postings. First, with probability, the ser is picked from the opposite side, and with probability ˆ from the same side. Within the set of sers on either side, a random ser is chosen to complete the link. 4.2 Experimental Reslts In the first experiment, we vary the average nmber of postings per ser, while keeping prity fixed at 0.8. Ths the ratio of antagonistic links to reinforcement links is fixed. However, what affect accracy is not the overall fraction of reinforcement links, bt rather how many sers have fewer antagonistic links than reinforcement links. Not only will sch sers be classified incorrectly, they also make it mch harder to correctly classify the people they are connected to. As the average nmber of postings per ser increases, the probability that a specific ser has more antagonistic links than reinforcement links increases significantly. Therefore accracy increases correspondingly, as shown in Figre 10. In the second experiment, we hold the average nmber of postings per athor fixed at 3 and vary prity parameter. Since the nmber of postings per ser is small, the nmber of sers with more antagonistic links than reinforcement links largely follows the prity fraction, and correspondingly accracy also follows prity, as shown in Figre 11. However, when the nmber of postings is large, we can get high accracies even if a large fraction of interactions are between peo- 0.5 0.6 0.7 0.8 0.9 1 Prity Figre 12: Accracy jmps with the nmber of postings, for a given prity factor (EV+KL Algorithm). ple on the same side. The explanation is the same as that for the first experiment. Figre 12 shows the accracy of the EV+KL algorithm for average nmber of postings ranging from 2 to 10, and prity ranging from 0.5 to 1. Notice that even for a low prity of 0.7 (where 30% of the postings are between people on the same side), we are able to get an accracy of 85% when there are 10 postings per ser. At a prity of 0.8, we get 98% accracy. Ths, given sfficient interactions, the algorithm is ndistrbed by a sbstantial fraction of reinforcement links. 5. CONCLUSIONS Drawing pon the hge sccess of exploiting the link information in comptations over hypertext corpora, we conjectred that links arising ot of who-interacts-with-whom might be more valable than the text of the interaction. We tested this conjectre in a setting where we completely ignored the text and relied only on links. The application we considered in this experiment was the classification of people on the two sides of an isse nder discssion in a Usenet newsgrop. We developed a links-only algorithm for this task that exhibited significant accracy advantage over the classical text-based algorithms. These accracy reslts are qite interesting, particlarly when one considers that we did not even ascertain whether a re-

sponse was antagonistic or reinforcing, for that wold have violated the no-looking-into-the-text dictm. We simply assmed that all links in the network were antagonistic. As long as the nmber of reinforcing responses are relatively few, or there are enogh postings per athor, the algorithm can withstand this error in bilding the network. For ftre work, it will be interesting to explore if the accracy of the proposed techniqes can be frther improved by incorporating advanced lingistic analysis of the text information. More generally, it will be interesting to stdy other social behaviors and investigate how we can take advantage of links implicit in them. Acknowledgments We wish to thank Xianping Ge for his help with the text-based classification experiments. 6. REFERENCES [1] R. Agrawal, R. Bayardo, and R. Srikant. Athena: Mining-based Interactive Management of Text Databases. In Proc. of the Seventh Int l Conference on Extending Database Technology (EDBT), Konstanz, Germany, March 2000. [2] A. Blm and T. Mitchell. Combining labeled and nlabeled data with co-training. In COLT: Proc. of the Workshop on Comptational Learning Theory. Morgan Kafmann Pblishers, 1998. [3] A. Broder and M. Henzinger. Information retrieval on the web: Tools and algorithmic isses. In Fondations of Compter Science, Invited Ttorial, 1998. [4] C. J. C. Brges. A ttorial on spport vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121 167, 1998. [5] S. Chakrabarti. Data mining for hypertext: A ttorial srvey. SIGKDD Explorations, 1, 2000. [6] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization sing hyperlinks. In Proc. of the ACM SIGMOD Conf. on Management of Data, Seattle, Washington, May 1998. [7] R. Collobert and S. Bengio. SVMTorch: Spport vector machines for large-scale regression problems. Jornal of Machine Learning Research, 1:143 1, 2001. [8] P. Domingos and M. Richardson. Mining the network vale of cstomers. In 7th Int l Conference on Knowledge Discovery in Databases and Data Mining, pages 57 66, San Francisco, California, Agst 2001. [9] T. Fawcett and F. J. Provost. Adaptive frad detection. Data Mining and Knowledge Discovery, 1(3):291 316, 1997. [10] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In IJCAI, pages 1300 1309, 1999. [11] M. X. Goemans and D. P. Williamson..878-approximation algorithms for MAX CUT and MAX 2SAT. In Proc. of the Twenty-Sixth Annal ACM Symposim on the Theory of Compting, pages 422 431, Montréal, Qébec, Canada, 23 25 May 1994. [12] I. Good. The Estimation of Probabilities: An Essay on Modern Bayesian Methods. M.I.T. Press, 1965. [13] T. Joachims. Transdctive inference for text classification sing spport vector machines. In I. Bratko and S. Dzeroski, editors, Proc. of ICML-99, 16th Int l Conference on Machine Learning, pages 200 209, Bled, SL, 1999. Morgan Kafmann Pblishers, San Francisco, US. [14] H. Karloff. How good is the Goemans-Williamson MAX CUT algorithm? In Proc. of the Twenty-Eighth Annal ACM Symposim on the Theory of Compting, pages 427 434, Philadelphia, Pennsylvania, 22 24 May 1996. [15] R. M. Karp. Redcibility among combinatorial problems. In R. E. Miller and J. W. Thatcher, editors, Complexity of Compter Comptations, pages 85 103. Plenm Press, New York, 1975. [16] G. Karypis and V. Kmar. A fast and high qality mltilevel scheme for partitioning irreglar graphs. Technical Report TR 95-035, University of Minnesota, Dept. of Compter Science, 1995. [17] H. Katz, B. Selman, and M. Shah. Referralweb: Combining social networks and collaborative filtering. Commnications of the ACM, 40(3):63 65, 1997. [18] B. W. Kernighan and S. Lin. An efficient heristic procedre for paritioning graphs. The Bell System Technical Jornal, pages 291 307, 19. [19] S. Milgram. The small world problem. Psychology Today, 2: 67, 1967. [20] T. Mitchell. The role of nlabeled data in spervised learning. In Proc. of the Sixth International Colloqim on Cognitive Science, San Sebastian, Spain, 1999. [21] T. M. Mitchell. Machine Learning. McGraw-Hill, 1997. [22] J. Neville and D. Jensen. Iterative classification in relational data. In Proc. AAAI-2000 Workshop on Learning Statistical Models from Relational Data. AAAI Press, 2000. [23] K. Nigam, A. K. McCallm, S. Thrn, and T. M. Mitchell. Learning to classify text from labeled and nlabeled docments. In Proc. of AAAI-98, 15th Conference of the American Association for Artificial Intelligence, pages 792 799, Madison, US, 1998. AAAI Press, Menlo Park, US. [24] B. Pang, L. Lee, and S. Vaithyanathan. Thmbs p? sentiment classification sing machine learning techniqes. In Proc. of the 2002 Conference on Empirical Methods in Natral Langage Processing (EMNLP), pages 79 86, 2002. [25] G. Salton. Atomatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Compter. Addison-Wesley, New York, 1989. [26] M. F. Schwartz and D. C. M. Wood. Discovering shared interests sing graph analysis. Commnications of the ACM, 36(8):78 89, 1993. [27] D. Spielman and S. Teng. Spectral partitioning works: Planar graphs and finite element meshes. In 37th Annal Symposim on Fondations of Compter Science, 1996. [28] P. Trney. Thmbs p or thmbs down? semantic orientation applied to nspervised classification of reviews. In Proc. of the 40th Annal Meeting of the Association for Comptational Lingistics (ACL 2002), Jly 2002. [29] V. N. Vapnik. The natre of statistical learning theory. Springer Verlag, Heidelberg, DE, 1995. [30] G. K. Zipf. Hman Behavior and the Principle of Least Effort. Addison-Wesley, 1949.