Name Reference Resolution in Organizational Archives
|
|
|
- Magdalene Morris
- 10 years ago
- Views:
Transcription
1 Name Reference Resolution in Organizational Archives Christopher P. Diehl Lise Getoor Galileo Namata Abstract Online communications provide a rich resource for understanding social networks. Information about the actors, and their dynamic roles and relationships, can be inferred from both the communication content and traffic structure. A key component in the analysis of online communications such as is the resolution of name references within the body of the message. Name reference resolution relies on the context of the message; both the content of the message and the sender and recipients relationships can help to resolve a reference. Here we investigate a variety of approaches which make use of the traffic network to disambiguate name references. The traffic network serves as a proxy for inferring relationships. These relationships in turn help us infer likely candidates for the name references. Our initial findings suggest that simple temporal models can help us effectively resolve name references. For the class of models proposed, performance is maximized by exploiting long-term traffic statistics to rank candidates. 1 Introduction Within the networked world, has become a ubiquitous form of global communication. Whether communicating with friends or colleagues in a local area or halfway around the world, allows us to maintain or develop relationships with others at any distance. Given traffic is a reflection of the relationships in an underlying social network, archives present a potentially rich collection of evidence that can be used to infer the structure, attributes and dynamics of the social network. The challenge is to infer these properties from data that is often ambiguous, incomplete and context-dependent. collections contain both structured and unstructured data. The structured data or metadata indicates which parties communicated and when the communication occurred. By focusing solely on the metadata, we can identify communication patterns, but we cannot easily ascribe meaning to the underlying relationships. The unstructured data in the body of the can clarify the roles of individuals and their relationships with others. Yet without the appropriate context, an outside observer may find a message pro- Johns Hopkins Applied Physics Laboratory, Johns Hopkins Road, Laurel, MD 20723, [email protected] Computer Science Department/UMIACS, University of Maryland, College Park, MD 20742, [email protected] Computer Science Department/UMIACS, University of Maryland, College Park, MD 20742, [email protected] vides little insight. When communicating with others, people constantly rely on shared context to simplify communication. Shared context is common knowledge among individuals that allows them to use ambiguous references which are clear within the shared context. A common example of this occurs when two people refer to a mutual friend by a first name or a nickname in conversation. How s John doing today? Is he feeling better Given the topic of conversation along with the name reference, it is clear to both parties who John is. Yet to someone without knowledge of the context, the reference is meaningless. Consider the problem of exploiting name references in the body. Name references are an important element in understanding the social network. Before we can process the content in the archive and associate activities and other attributes with individuals, we need to infer the number and identities of the individuals generating the observed traffic. Each individual has two classes of references: network references and name references. Network references in the context of are simply the individual s addresses. Note that this is potentially a many-many mapping: individuals may have multiple addresses and a single address may serve more than one individual. There is also a temporal component; an individual may have one address for the time they are in one position in the company, but when they change roles within the company, perhaps moving to another division, their address may change. Name references are the various forms of an individual s given name along with their nicknames that may appear in the body. In order to define an individual s identity and draw broader connections across s in the archive, we need to be able to map both name references and network references to the individual. In this paper, we focus on the problem of mapping ambiguous name references, specifically first name references, to network references. The core challenge in this problem is identifying ways to exploit context from the archive to effectively resolve the ambiguity. We describe this in the next section. Next, we formally define the general problem of name reference resolution. 70
2 Then we discuss the types of context available that can potentially be exploited. We investigate several different approaches, which vary in the context features and temporal models used, and introduce a methodology for evaluating their performance. Finally we present results from our algorithm evaluation on the Enron archive and conclude with thoughts on future work. 2 Exploiting Context When reading , what types of context do we exploit to resolve ambiguous name references? In addition, what context does an collection offer when analyzing relationships retrospectively? Below is a list of some of the contextual cues available to us for understanding name references. The participants in the conversation The larger group of people known by the participants in the conversation and the types of relationships among them The individuals that the participants in the conversation have recently communicated with, either before or after the was sent The topic of conversation in the Recent topics of conversation among the participants and others outside the current conversation, either before or after the was sent Cues contained within other s in the thread Related name references within the current Prior knowledge linking individuals to topics of conversation This list of contextual cues is by no means exhaustive. Yet it reminds us of the two broad classes of context that captures: social context (who s talking?) and topical context (what are they talking about?). Our long term goal is to exploit both to characterize the underlying social network, as each form of context can help clarify ambiguities in the other. Yet the challenge of capturing and exploiting dynamic topical context is a significant research thrust on its own, as evidenced by the work in the topic detection and tracking community [3]. Our focus in this paper will be to investigate the discriminative power of dynamic social context. We want to first understand the performance of algorithms that leverage the patterns of communication among network references to estimate the mapping between name and network references. 3 Problem Definition Let E = {e i } be a set of addresses observed in the collection and let N = {n j } be a set of observed name references in the bodies. The set E may be extracted from the metadata, or the set may come from another source such as an employee directory, which lists individuals together with their s. The set N is the result of an entity extraction process that identifies name references within the bodies. The objective of name reference resolution is to construct a mapping from a set of observed name references N = {n j } in the collection to either ranked subsets of network references, E j, where E j E or the null network reference φ, if no network reference is sufficiently probable. The null network reference φ serves two purposes. First, it is not a given that there exists a corresponding network reference for each name reference. An collection may not contain exchanges between all individuals referenced within the bodies. Second, the entity extraction process will incorrectly declare some terms in the collection to be name references, for which there is no network reference. In both cases, the appropriate response is to map the given name reference to φ. For each name reference n j, the corresponding candidate set E j is ranked based on the context of the name reference. A scoring function g is used to compute the strength of association g(e c, n j C j ) between each candidate e c E j and n j, given the context C j associated with n j. Once all of the candidates have been scored, they are ranked in descending order and only those candidates with scores g(e c, n j C j ) > λ are retained. The most likely network reference ẽ(n j ) is either the candidate with the maximum score greater than the threshold or φ otherwise. In this paper, we explore the use of the traffic context for ranking the candidate set. We define the traffic network for a set of messages M = {m i } as follows: we have a directed hypergraph G M with the set of vertices E and hyperedges H = {(e si, E ri, t i )}. For each message m i, there is a hyperedge from the sender network reference e si, e si E, to the set of recipients of the message, E ri E. The attribute t i is the time at which the was sent. 4 Name Reference Resolution Process The general name resolution process is composed of three phases: candidate set generation, candidate ranking and candidate rejection illustrated in figure 1. Given we envision a data analyst reviewing the candidate associations in rank order to identify the true referent, our overall goal is to minimize the number of candidates the user must evaluate while identifying as many true net- 71
3 Fr: James S. To: Christi N. Date: Oct Subject: FW: Dynegy v. ComEd update Susan and I are on the case. Are you still in DC? Christi N. Susan L. Donna F. Susan B. Rogest H. James S. Susan P. Scored Candidates List: Susan L. Susan P. Susan B. Minimum Score Threshold Final Candidates List: Susan L. Susan P. Figure 1: The name reference resolution process work references as possible. When the true referent is a member of the candidate set, we want the algorithm to rank the true network reference as high as possible. Given the true referent may not be part of the candidate set at all, we also want to reject as many candidates as possible without severely impacting recall. 4.1 Candidate Set Generation The role of the candidate set is to restrict our attention to a small number of likely candidates prior to scoring the candidates. In our initial approach, we use two levels of screening. We begin with the strong assumption that if any communication has occurred between the true referent and the participants, the sender was involved. Therefore we initially restrict the candidate set to those network references where at least one communication has been observed with the sender. Although we expect this assumption will be true in many cases, there are clearly instances where it will break down. For example, not all name references correspond to individuals that the sender knows personally. Within the context of an organization, references may be made to individuals many levels removed in the management hierarchy. It is also not a given that an active relationship will be observable through communication. The parties involved may be in close physical proximity allowing direct communication or may use other means of communication. A third possibility is that the communications are simply not available in the collection for one of a variety of reasons. Regardless of these factors, as we show in the results section, we are able to achieve suprisingly high recall. Our second level of screening relies on available name information for the network references. We assume that some name information is initially available either from the name tags attached to addresses or from the addresses themselves. In our initial experiments, we examine name references that match exactly at least the first or last name associated with the candidate network reference. Clearly this constraint can be relaxed by employing a string comparison function to look for close name matches. 4.2 Candidate Scoring As mentioned earlier, our main interest is in defining and evaluating candidate scoring functions that leverage dynamic social context. If we begin with the presumption that name reference usage is often connected to events occurring around the time of the reference, the question is what fraction of the name references can we resolve by ranking candidates based on the level of traffic around the time of the reference? To explore this, we introduce a class of scoring functions and explore the sensitivity of their performance along four general dimensions. The relationships examined The time scale at which the traffic is viewed The summary statistic used to characterize relationship activity during a given time interval The degree of traffic history considered We consider each of these dimensions next and then describe two temporal models which make use of features defined according to these dimensions Relationships Given our assumption of direct communication between at least the sender and the true referent, our objective is to characterize the degree of communication between the participants, the sender and recipients E p = e s E ri, and the candidate network reference e c. Specifically we consider models that exploit either solely the traffic between the sender e s and the candidate e c (denoted sender-only) or models that exploit the pairwise traffic between all the participants, sender and recipients, and the candidate e c (denoted sender+recipients). When integrating traffic from the sender and recipients pairwise interactions with the candidate, we want to understand the relative discrimination power offered by each and identify summary 72
4 statistics that effectively leverage the relationships for candidate scoring Time Scale To examine the traffic at a given time scale, we first partition the time axis into regular intervals of duration t. The phase of the partition is fixed by first selecting a reference time t 0 such that t 0 t r < t 0 + t where t r is the time of the containing the name reference. The time intervals {T k } are defined as T k = {t : t 0 + k t t < t 0 + (k + 1) t, k Z} so that the time interval T 0 includes the time t r of the 1. In our experiments, we investigate daily and weekly time intervals (denoted daily and weekly). The weekly time intervals are phased such that they begin on Sunday Summary Statistics Once the time axis is partitioned into regular intervals, our next step is to compute a summary statistic or feature s(e p, e c, T k, G M ) for each interval T k that provides an indication of relationship activity among some or all of the participants E p and the candidate e c. We consider the following variations on computing the statistic: Binary versus Count. For any pair of network references, for the given interval, we may either have a 0/1 indicator which denotes whether or not there has been an exchange between the pair (denoted binary) or we may want to use the frequency information and keep track of the number of messages exchanged (denoted count). Unidirectional versus Bidirectional. For any network reference, we may be interested in only the messages sent from the network reference to the candidate reference (denoted unidirectional) or we may be interested in bidirectional exchanges where the candidate and network references can take on either the sender or recipient roles (denoted bidirectional) As mentioned earlier, we can distinguish models which make use of the sender-only traffic information versus the sender+recipients traffic information. In the latter case, we introduce the parameter β to weight the sender versus recipient contributions. Table 1 summarizes the statistics used in the experiments Integrating Traffic History The final step in computing the candidate score g(e c, n C) given the context C = {E p, T 0, G M } involves integrating the 1 Although the definition of T 0 is dependent on the of interest, we will not explicitly indicate this dependence to avoid additional complexity in the notation. summary statistics for time intervals around the time of the containing the name reference. We compute the local time average of the summary statistics using either a non-causal autoregressive (denoted AR) or moving average filter (denoted MA) that incorporates both future and past traffic patterns around the time of the name reference. The autoregressive filter is defined as g AR (e c, n E p, T k, G M ) = (1 α) 2 g AR (e c, n E p, T k 1, G M ) + (1 α) 2 g AR (e c, n E p, T k+1, G M ) + αs(e p, e c, T k, G M ) (1 α) i (s(e p, e c, T k i, G M ) = α 2 i=0 +s(e p, e c, T k+i, G M )) while the moving average filter is defined as g MA (e c, n E p, T k, G M ) = 1 M s(e p, e c, T k i, G M ). 2M+1 i= M In practice, when evaluating the AR filter, we terminate the summation once a convergence criterion is met. The degree of traffic history incorporated into the candidate score g(e c, n E p, T 0, G M ) is controlled by the parameters α for the AR model and M for the MA model. 4.3 Candidate Rejection Once the scores have been computed for all network references in the candidate set, the candidates with a score below the specified threshold λ are removed from the candidate set. The objective of candidate rejection is to remove candidates that are deemed unlikely to correspond to name references without rejecting a significant fraction of true referents. The degree of performance achieved is dependent on the ability of the scoring function to separate the true referents from the other candidates. Within the context of the models proposed above, performance clearly depends on the following two factors. First, it is dependent on the legitimacy of the general assumption that a high degree of communication activity around the time of the name reference is indicative of a potential correspondence between a name and network reference. Second, performance is also dependent on the model s characterization of what qualifies as a high degree of traffic. All relationships are clearly not equivalent. Yet our baseline models do not attempt to capture external factors influencing the relationship activity. We will revisit these issues in later discussion. 73
5 Table 1: Summary statistic definitions. m(e 1, e 2, T k, G M ) is the number of messages sent from network reference e 1 to network reference e 2 over the time interval T k. I( ) is the indicator function. Name Definition Binary, Sender-Only, Bidirectional I(m(e s, e c, T k, G M ) + m(e c, e s, T k, G M )) Count, Sender-Only, Bidirectional m(e s, e c, T k, G M ) + m(e c, e s, T k, G M ) Count, Sender-Only, Unidirectional m(e s, e c, T k, G M ) Count, Sender+Recipients, Bidirectional(β) (1 β)(m(e s, e c, T k, G M ) + m(e c, e s, T k, G M ))+ e ri E ri (m(e ri, e c, T k, G M ) + m(e c, e ri, T k, G M )) β E ri 5 Experiment Design With a set of models defined, the next major task at hand is evaluating their performance on a representative dataset. The bulk of our efforts to date have focused on data preparation, ground truth generation and definition of evaluation protocols. A number of subtle but important issues arise as one considers the various elements of the overall experiment. We review all aspects of the approach in the following sections. 5.1 Dataset Preparation The Data: Enron Corpus With the recent release of the Enron dataset [20], researchers have been given a unique opportunity to glimpse inside a large corporation and observe a subset of traffic among the employees. The Enron dataset is the collection of from the folders of 151 Enron employees. The data is available in several forms. CMU first released the original data. Since then USC/ISI and more recently UC Berkeley have released normalized forms of the data in a MySQL database. Our results are based on the USC/ISI version of the dataset. There are over 250,000 messages in the dataset with the majority of the traffic occurring in the time frame. We initially chose to resolve name references in only those s exchanged between the core 151 employees. This was done primarily to reduce confounding effects of observability in our experiments. Given we can only observe pairwise relationships where at least one of the participants is a member of the set of 151 employees, constraining the set of s in this way guarantees that all relationships we consider in the resolution process are observable in the collection, assuming s haven t been lost or deleted. There are s in the ISI database that were exchanged among the 151 employees. A non-trivial number of duplicate s exist that were removed to avoid skewing the results of the analysis. After deduplication of this set, s remain. This is the set of s from which the name references were extracted Extracting Enron Employee Names To support named entity extraction and candidate set generation, we constructed a network reference set E of 7864 Enron addresses and a corresponding list of employee names by parsing the addresses. In total, there are 29,176 enron.com addresses in the collection. This includes employee addresses along with group mailing lists. Given the most common address format often corresponding to employees is < name1 >. < name2 we parsed these addresses and saved only those where either name1 or name2 matched a first or last name in the employeelist table in the ISI database. This reduced the list to addresses that are distinct from the addresses listed for the 151 employees in the ISI database. As others have noted, some employees have multiple addresses in the collection. We believe that in most cases this is due to an employee moving within the company. Therefore each address and its associated relationship structure characterizes the employee s role over a certain time period in the company. We chose not to deduplicate the addresses in order to preserve this context Constructing the Traffic Network The hypergraph G M representing the traffic network captures the observed traffic exchanged between the 7864 Enron addresses in E. Since the Enron collection is the union of folders corresponding to the given 151 Enron addresses, G M only captures the traffic exchanged between those 151 addresses and the remaining addresses in E. There are s in the ISI database that were exchanged among the addresses. After deduplication of this set, s remain. Therefore G M is composed of 7864 nodes and hyperedges Detecting Name References To detect name references in the bodies, we initially scan 74
6 through the s searching for words that match exactly one or both first and last names of an employee on the list of 7864 Enron employee names. We also merge adjacent partial name matches, assuming in most cases this results in a full name not listed on the employee list. For our initial experiments, we chose to focus on resolving first name references to others outside of the conversation. Therefore to filter out name references not of interest, we saved only partial name detections that matched one of the 151 Enron employee first names. Then we filtered out first name references at the beginning and end of the text, assuming those are references to the sender and recipients. 5.2 Ground Truth Generation To evaluate algorithm performance, we manually identified the true network references associated with a set of first name references. In some cases, the true referent was obvious from other name references in the sender s message or the attached message. In others, we needed to search through the traffic to find other s in the thread or previous conversations to clarify the reference. When multiple addresses appear to correspond to the referenced individual, the address in use around the time of the name reference is chosen as the true referent. After this processing, we have 84 labelled first name references with candidate sets of size 2 or greater. Of these, 54 have candidate sets that contain the true referent. A number of first name references with no obvious context in the message could not be resolved after further searching of the collection. 5.3 Performance Evaluation When evaluating the performance of a given scoring function, we have two objectives. First, we want to understand how well the scoring function ranks the true referent relative to other candidates on average in a candidate set. We refer to this as the relative ranking performance of the scoring function. Second, we want to characterize the ability of the scoring function to rank true referents higher than other candidates in general across candidate sets. We refer to this as the absolute ranking performance of the scoring function. We consider each evaluation task in the following sections Relative Ranking Performance To provide insights into relative ranking performance, three performance metrics are evaluated for each scoring function. First, we compute the rank 1 rate (R1R) which is the fraction of candidate sets containing true referents over which the true referent is the top ranked candidate. This is expressed as R1 R = 1 N t n N t I (ẽ(n) = e true (n)) where I( ) is the indicator function, e true (n) is the true network reference associated with the name reference n and N t = {n : n N, e true (n) E(n)} is the set of name references with the true referent in the corresponding candidate sets. Note the R1R is computed assuming no candidate rejection. The rank 1 rate provides an intuitive summary of performance, but can be misleading in this context given the variable sized candidate sets. Therefore to establish a relative baseline, we compute the expected value of the random rank 1 rate (RR1R) achieved by random selection of the top ranked candidate from each candidate set. This is expressed as RR1 R = 1 N t 1 E(n). n N t Since the rank 1 rate gives no indication of how severe the failure is when the true referent is not rank 1, we also compute a metric we refer to as the average true referent rank (ATRR). The ATRR is the average of the ratio of the true referent rank and the candidate set size. This is expressed as AT RR = 1 N t 1 E(n) n N t E(n) k=1 ki ( e (k) (n) = e true (n) ) where e (k) (n) is the network reference with rank k in the candidate set E(n). Each true referent rank is normalized by the corresponding candidate set size to account for the variation in the number of candidates and reduce the sensitivity of the measure to large candidate sets Absolute Ranking Performance Assessing the absolute ranking performance involves evaluating the scoring function s ability to rank true referents higher than other candidates across all candidates nominated for a given set of name references. Our interest in characterizing ranking performance from this perspective stems from our desire to reject as many candidates as possible without a significant loss of true referents. If the scoring function is able to separate the two classes of candidates with reasonable success, we will achieve our aim. A natural measure of ranking performance advocated in the literature [2, 5, 6, 8] is the area under the receiver operating characteristic (ROC) curve. The 75
7 ROC curve is a standard depiction of a detector s performance from classical signal detection theory, showing the detector s true positive rate versus false positive rate [22]. The area under the ROC curve (AUC) provides a measure of the separability achieved by the detector between the two classes. More specifically, the empirical AUC is an estimate of the probability that the detector will rank a randomly selected positive example higher than a randomly selected negative example, assuming all ties are broken uniformly at random [2]. When the AUC=1.0, perfect separability is achieved. When the AUC=0.5, the detector performs no better than random chance. If one defines the true referents to be the positive class and the other candidates to be the negative class, the AUC of the scoring function is the area under the empirical ROC curve generated by sweeping the threshold over the range of scores and computing the (false positive rate,true positive rate) operating points on the curve. This empirical AUC can be directly expressed in the following manner 1 AUC = N T R N OC n 1 N t n 2 N e oc E(n 2)/e true(n 2) I (g(e true (n 1 ), n 1 C 1 ) > g(e oc, n 2 C 2 )) I (g(e true(n 1 ), n 1 C 1 ) = g(e oc, n 2 C 2 )) where N T R = N t is the number of true referents and N OC = n N E(n)/e true(n) is the number of other candidates overall [2]. 6 Discussion We now examine the performance of the various scoring functions on the labelled name reference data. Figures 2-4 present a series of summary plots showing the rank 1 rates, average true referent ranks and AUCs of the scoring functions as a function of the amount of traffic history considered. Consider first the R1R and ATRR metrics measuring relative ranking performance. For all models, as the filter duration is increased 2, incorporating more traffic history into the scoring process, the relative ranking performance generally increases and approaches a maximal level of performance. In terms of rank 1 rate, the performance of these simple models approaches 0.8 in most cases with sufficient history and significantly outperforms the random selection baseline. Finer level 2 The duration of the MA filter is simply the number of time intervals over which the filter averages the summary statistic. We have defined the duration of the AR filter to be twice the number of time intervals required for the impulse response of the filter to decay to 10% of its peak response. distinctions among the models can not yet be made; if one assumes the name reference resolutions are independent, there is no statistically significant difference in performance among the models considered. At the beginning of this investigation, our expectation was that the traffic patterns around the time of a given name reference would clarify the identity of the true referent. For the models we have proposed, the results suggest quite the opposite: ranking performance is maximized when we consider the long-term traffic patterns over one year or longer. This implies that by simply examining prior relationship strengths, as represented by the volume of communication, we are able to successfully resolve a majority of first name references. In hindsight, this result is not very surprising. While the use of a name reference in a given may be related to an ongoing event, it is not clear that we should also expect a notable increase in communication between the true referent and the sender around that time. For example, it is possible for a single from the true referent to the sender to be the impetus for the containing the name reference. Meanwhile, the sender may be engaged in longer threads of conversation with other potential candidates, thereby leading to a low rank for the true referent. In these scenarios, examining the content will be critical to correctly resolve the reference. Now let us consider the AUC metric measuring absolute ranking performance. In contrast to the relative ranking results, we see a significant distinction between the sender-only models and the sender+recipients models. As the influence of the relationships between the recipients and the candidate is increased, the AUC curve continues to shift lower indicating that separability between the true referents and the other candidates is decreasing across all filter durations. At first glance, it may seem that the trends for the relative and absolute ranking performance measures are inconsistent. Why should the relative ranking performance be fairly insensitive to the influence of the recipients while the absolute ranking performance is much more so? This result suggests that while the relative rankings are not changing significantly, the variances of the true referent and other candidate score distributions are increasing, causing the decrease in separability. This is not surprising for a number of reasons. As we add more pairwise relationships, we introduce more opportunities for the candidate s score to be falsely inflated. Multiple active pairwise relationships do not necessarily signify higher order dependencies between the relationships. Another factor is that not all users are equivalent. Some users are more prolific composers than others, leading 76
8 to score inflation once again that is misleading. Further investigation is needed to explore these issues. Clearly we have only begun to explore the full scope of the name reference resolution problem. In this initial experiment, we have focused on first name references and limited our search for the true referent to those candidates that the sender has communicated with. Even with first name references, it is possible that a given reference corresponds to an individual that the sender never communicates with. As discussed earlier, this can occur for a variety of reasons. Most interesting are the cases where the references are to individuals that the sender knows of but does not have a relationship with. To deal with these challenges, a more sophisticated approach is required. Another important aspect of the resolution task to note is the dependence among name references within a given and across s. Our belief about one name reference can certainly impact and inform the resolution of other related references. Therefore it is valuable to incorporate such connections into the resolution process. To summarize, we see that simple temporal traffic models perform surprisingly well for name reference resolution when considering the long-term traffic patterns. To push the performance beyond a certain level, we expect that both content and traffic patterns will need to be exploited. We are in the process of generating more labelled name references to support experimentation aimed at better understanding the limits of traffic-based approaches. 7 Related Work This paper uses a social network generated from the traffic of the Enron data set as a tool for name reference resolution. In this section, we describe some of the relevant related work on social networks, the Enron data set and entity resolution. 7.1 Social Networks There has been a great deal of recent work in social network generation, analysis and mining. Using semantic associations from communication, for example, McArthur and Bruza [17] propose methods of generating a social network using implicit and explicit connections between people. Liben-Nowell and Kleinberg [15] use co-authorship to create social networks to predict future interactions among members of a given social network. Studies have also been done on creating and mining social networks to identify possible collaborators for a given problem [18, 11] and clustering people of similar interests [19]. Schwartz and Wood generate a social network using the to and from fields of messages to discover users of a particular interest and field. 7.2 Enron The release of the Enron data set in 2003 provided an unprecedented collection of s from a major organization for use in research. Klimt and Yang [14] provides an overview of this corpus including the number of employees, the number of s and a representative social network derived from the traffic. Moreover, they used the Enron data to explore methods of classification [13]. Corrada-Emmanuel [4] created MD5 hashes of the Enron s and contact information to identify and deduplicate s. Using the structure of the s, Keila and Skillicorn [12] found a relationship in the word use pattern with message length as well as relationships among individuals. Skillicorn [21] further demonstrated methods to detect unusual and deceptive communications. Diesner and Carley [7] used analysis of the social network patterns over time to explore crisis detection in . Moreover, a number of useful tools have also been developed in order to navigate and view archives [9]. 7.3 Entity Resolution in There has been limited work in named entity resolution in systems. Abadi [1] uses s from an online retailer for anaphora resolution within orders. Abadi s research, however, is designed for the resolution of pronouns referring to product orders rather than individuals and relies mainly on NLP for resolution. Holzer, Malin and Sweeney [10] on the other hand use social networks created from online resources like personal websites to resolve aliases. Their approach of using social networks derived from relations from other sources, including proximity of references in a given web site, is particularly effective in controlled environments such as the university used in their evaluation. Of note, is Malin s evaluation of methods of disambiguation in relational environments [16]. Although Malin s work used actor collaborations in the Internet Movie Database rather than , Malin did find that methods which leverage community, in contrast to exact similarity provide more robust disambiguation capability, supporting our approach to the problem. 8 Conclusion In this paper, we have examined ways in which traffic can be used to resolve ambiguous name references within the body of the messages. Our contributions include a formal statement of the problem; the definition of the resolution process in terms of candidate generation, candidate scoring and candidate rejection; and an evaluation methodology that examines both absolute and relative rank. We have presented an initial suite of models for candidate scoring, which exploit both role and temporal information, and evalu- 77
9 ated their performance on a real-world corporate archive, the Enron collection. Surprisingly, such models perform well by simply ranking candidates based on long-term traffic statistics, a natural surrogate for relationship strength. Our overall goal is to develop robust ways of exploiting context information during the resolution process. The traffic network is just one element of the context information we hope to eventually exploit. References [1] D. Abadi. Comparing domain-specific and nondomain-specific anaphora resolution techniques. Master s thesis, Cambridge University MPhil Dissertation, [2] S. Agarwal, T. Graepel, R. Herbrich, and D. Roth. Generalization bounds for the area under the ROC curve. Journal of Machine Learning Research, 6: , [3] J. Allan. Topic Detection and Tracking. Kluwer Academic Pub, [4] A. Corrada-Emmanuel. Enron dataset research, corrada/enron. [5] C. Cortes and M. Mohri. AUC optimization versus error rate minimization. Advances in Neural Information Processing Systems, 16, [6] C. Cortes and M. Mohri. Confidence intervals for the area under the ROC curve. Advances in Neural Information Processing Systems, 17, [7] J. Diesner and K. Carley. Exploration of communication networks from the Enron corpus. In Proc. of Workshop on Link Analysis, Counterterrorism and Security, SIAM International Conference on Data Mining 2005, Newport Beach, CA, USA, April [8] Y. Freund, R. Iyer, R. Shapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4: , [9] J. Heer. Exploring Enron: Visualizing ANLP results, [10] R. Holzer, B. Malin, and L. Sweeney. alias detection using social network analysis. In Proceedings of the ACM SIGKDD Workshop on Link Discovery: Issues, Approaches, and Applications, Chicago, Illinois, USA, August [11] H. Kautz, B. Selman, and M. Shah. Referral web: Combining social networks and collaborative filtering. Communications of the ACM, 40(3):63 65, [12] P. Keila and D. Skillicorn. Structure in the Enron dataset. In Workshop on Link Analysis, Security and Counterterrorism, SIAM International Conference on Data Mining, pages 55 64, [13] B. Klimt and Y. Yang. The Enron corpus: a new dataset for classification research. In European Conference on Machine Learning/PKDD, Pisa, Italy, September [14] B. Klimt and Y. Yang. Introducing the Enron corpus. In Conference on and Anti-Spam, Mountainview, CA, USA, July [15] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In Proc. 12th International Conference on Information and Knowledge Management (CIKM), New Orleans, Louisiana, USA, November [16] B. Malin. Unsupervised name disambiguation via social network similarity. In SIAM International Conference on Data Mining, Newport Beach, CA, USA, April [17] R. McArthur and P. Bruza. Discovery of implicit and explicit connections between people using utterance. In Proceedings of the Eighth European Conference of Computer-supported Cooperative Work, Helsinki, Helsinki, Finland, September [18] H. Ogata and Y. Yano. Collecting organizational memory based on social networks in collaborative learning. In WebNet, pages , [19] M. Schwartz and D. Wood. Discovering shared interests among people using graph analysis of global electronic mail traffic. Communications of the ACM, 36:78 89, [20] J. Shetty and J. Adibi. The Enron dataset: Database schema and brief statistical report. adibi/enron/ Enron Dataset Report.pdf. [21] D. Skillicorn. Detecting unusual and deceptive communication in . In Centers for Advanced Studies Conference, Richmond Hill, Ontario, Canada, October [22] H. V. Trees. Detection, Estimation, and Modulation Theory. John Wiley and Sons,
10 Autoregressive Filter, Daily Interval, RR1R = Autoregressive Filter, Daily Interval Rank One Rate Average True Referent Rank Duration (Days) Duration (Days) (a) (b) Autoregressive Filter, Daily Interval AUC Duration (Days) (c) Figure 2: Autoregressive Filter Performance: (a) Daily Interval Rank 1 Rates, (b) Average True Referent Ranks and (c) Areas Under the ROC Curves 79
11 Autoregressive Filter, Weekly Interval, RR1R = Autoregressive Filter, Weekly Interval Rank One Rate Average True Referent Rank Duration (Weeks) Duration (Weeks) (a) (b) Autoregressive Filter, Weekly Interval AUC Duration (Weeks) (c) Figure 3: Autoregressive Filter Performance: (a) Weekly Interval Rank 1 Rates, (b) Average True Referent Ranks and (c) Areas Under the ROC Curves 80
12 Moving Average Filter, Daily Interval, RR1R = Moving Average Filter, Weekly Interval, RR1R = Rank One Rate Rank One Rate Duration (Days) Duration (Weeks) (a) (d) Moving Average Filter, Daily Interval Moving Average Filter, Weekly Interval Average True Referent Rank Average True Referent Rank Duration (Days) Duration (Weeks) (b) (e) Moving Average Filter, Daily Interval Moving Average Filter, Weekly Interval AUC AUC Duration (Days) Duration (Weeks) (c) (f) Figure 4: Moving Average Filter Performance: (a) Daily Interval Rank 1 Rates, (b) Average True Referent Ranks and (c) Areas Under the ROC Curves (d) Weekly Interval Rank 1 Rates, (e) Average True Referent Ranks and (f) Areas Under the ROC Curves 81
How To Find An Alias On Email From A Computer (For A Free Download)
Email Alias Detection Using Social Network Analysis Ralf Hölzer Information Networking Institute Carnegie Mellon University Pittsburgh, PA 1513 [email protected] Bradley Malin School of Computer Science
Neovision2 Performance Evaluation Protocol
Neovision2 Performance Evaluation Protocol Version 3.0 4/16/2012 Public Release Prepared by Rajmadhan Ekambaram [email protected] Dmitry Goldgof, Ph.D. [email protected] Rangachar Kasturi, Ph.D.
Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy
Astronomical Data Analysis Software and Systems XIV ASP Conference Series, Vol. XXX, 2005 P. L. Shopbell, M. C. Britton, and R. Ebert, eds. P2.1.25 Making the Most of Missing Values: Object Clustering
The Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
Chapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
Relationship Identification for Social Network Discovery
Relationship Identification for Social Network Discovery Christopher P. Diehl Applied Physics Laboratory Johns Hopkins University Laurel, MD 20723 Galileo Namata and Lise Getoor Computer Science Department
SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS
SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS Carlos Andre Reis Pinheiro 1 and Markus Helfert 2 1 School of Computing, Dublin City University, Dublin, Ireland
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Semantic Search in Portals using Ontologies
Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network
, pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph
MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of
IBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Reputation Network Analysis for Email Filtering
Reputation Network Analysis for Email Filtering Jennifer Golbeck, James Hendler University of Maryland, College Park MINDSWAP 8400 Baltimore Avenue College Park, MD 20742 {golbeck, hendler}@cs.umd.edu
Performance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
Research Statement Immanuel Trummer www.itrummer.org
Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses
Cluster Analysis for Evaluating Trading Strategies 1
CONTRIBUTORS Jeff Bacidore Managing Director, Head of Algorithmic Trading, ITG, Inc. [email protected] +1.212.588.4327 Kathryn Berkow Quantitative Analyst, Algorithmic Trading, ITG, Inc. [email protected]
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.
Data Mining on Social Networks Dionysios Sotiropoulos Ph.D. 1 Contents What are Social Media? Mathematical Representation of Social Networks Fundamental Data Mining Concepts Data Mining Tasks on Digital
KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
MapReduce Approach to Collective Classification for Networks
MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty
Monitoring Corporate Password Sharing Using Social Network Analysis
Monitoring Corporate Password Sharing Using Social Network Analysis Andrew S. Patrick NRC Canada Paper to be presented at the International Sunbelt Social Network Conference, St. Pete Beach, Florida, January
Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach
Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Recognizing Informed Option Trading
Recognizing Informed Option Trading Alex Bain, Prabal Tiwaree, Kari Okamoto 1 Abstract While equity (stock) markets are generally efficient in discounting public information into stock prices, we believe
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive
virtual class local mappings semantically equivalent local classes ... Schema Integration
Data Integration Techniques based on Data Quality Aspects Michael Gertz Department of Computer Science University of California, Davis One Shields Avenue Davis, CA 95616, USA [email protected] Ingo
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Network Big Data: Facing and Tackling the Complexities Xiaolong Jin
Network Big Data: Facing and Tackling the Complexities Xiaolong Jin CAS Key Laboratory of Network Data Science & Technology Institute of Computing Technology Chinese Academy of Sciences (CAS) 2015-08-10
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Globally Optimal Crowdsourcing Quality Management
Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University [email protected] Aditya G. Parameswaran University of Illinois (UIUC) [email protected] Jennifer Widom Stanford
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Using News Articles to Predict Stock Price Movements
Using News Articles to Predict Stock Price Movements Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 9237 [email protected] 21, June 15,
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
CS231M Project Report - Automated Real-Time Face Tracking and Blending
CS231M Project Report - Automated Real-Time Face Tracking and Blending Steven Lee, [email protected] June 6, 2015 1 Introduction Summary statement: The goal of this project is to create an Android
Innovative Techniques and Tools to Detect Data Quality Problems
Paper DM05 Innovative Techniques and Tools to Detect Data Quality Problems Hong Qi and Allan Glaser Merck & Co., Inc., Upper Gwynnedd, PA ABSTRACT High quality data are essential for accurate and meaningful
A Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Data Mining in Web Search Engine Optimization and User Assisted Rank Results
Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Building a Data Quality Scorecard for Operational Data Governance
Building a Data Quality Scorecard for Operational Data Governance A White Paper by David Loshin WHITE PAPER Table of Contents Introduction.... 1 Establishing Business Objectives.... 1 Business Drivers...
A Test Collection for Email Entity Linking
A Test Collection for Email Entity Linking Ning Gao, Douglas W. Oard College of Information Studies/UMIACS University of Maryland, College Park {ninggao,oard}@umd.edu Mark Dredze Human Language Technology
Web Analytics Definitions Approved August 16, 2007
Web Analytics Definitions Approved August 16, 2007 Web Analytics Association 2300 M Street, Suite 800 Washington DC 20037 [email protected] 1-800-349-1070 Licensed under a Creative
Data Mining Applications in Higher Education
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
Cognitive and Organizational Challenges of Big Data in Cyber Defense
Cognitive and Organizational Challenges of Big Data in Cyber Defense Nathan Bos & John Gersh Johns Hopkins University Applied Laboratory [email protected], [email protected] The cognitive and organizational
Using Data Mining Methods to Predict Personally Identifiable Information in Emails
Using Data Mining Methods to Predict Personally Identifiable Information in Emails Liqiang Geng 1, Larry Korba 1, Xin Wang, Yunli Wang 1, Hongyu Liu 1, Yonghua You 1 1 Institute of Information Technology,
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.
How To Check For Differences In The One Way Anova
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Combining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Direct Marketing When There Are Voluntary Buyers
Direct Marketing When There Are Voluntary Buyers Yi-Ting Lai and Ke Wang Simon Fraser University {llai2, wangk}@cs.sfu.ca Daymond Ling, Hua Shi, and Jason Zhang Canadian Imperial Bank of Commerce {Daymond.Ling,
Invited Applications Paper
Invited Applications Paper - - Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK [email protected] [email protected]
How To Perform An Ensemble Analysis
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),
2. Basic Relational Data Model
2. Basic Relational Data Model 2.1 Introduction Basic concepts of information models, their realisation in databases comprising data objects and object relationships, and their management by DBMS s that
Conclusions and Future Directions
Chapter 9 This chapter summarizes the thesis with discussion of (a) the findings and the contributions to the state-of-the-art in the disciplines covered by this work, and (b) future work, those directions
Using Data Mining Techniques for Analyzing Pottery Databases
BAR-ILAN UNIVERSITY Using Data Mining Techniques for Analyzing Pottery Databases Zachi Zweig Submitted in partial fulfillment of the requirements for the Master s degree in the Martin (Szusz) Department
Graph Mining and Social Network Analysis
Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Working with telecommunications
Working with telecommunications Minimizing churn in the telecommunications industry Contents: 1 Churn analysis using data mining 2 Customer churn analysis with IBM SPSS Modeler 3 Types of analysis 3 Feature
Chapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
PartJoin: An Efficient Storage and Query Execution for Data Warehouses
PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE [email protected] 2
CHAPTER-24 Mining Spatial Databases
CHAPTER-24 Mining Spatial Databases 24.1 Introduction 24.2 Spatial Data Cube Construction and Spatial OLAP 24.3 Spatial Association Analysis 24.4 Spatial Clustering Methods 24.5 Spatial Classification
How to Analyze Company Using Social Network?
How to Analyze Company Using Social Network? Sebastian Palus 1, Piotr Bródka 1, Przemysław Kazienko 1 1 Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland {sebastian.palus,
Chapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
Data Cleansing for Remote Battery System Monitoring
Data Cleansing for Remote Battery System Monitoring Gregory W. Ratcliff Randall Wald Taghi M. Khoshgoftaar Director, Life Cycle Management Senior Research Associate Director, Data Mining and Emerson Network
Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro
Subgraph Patterns: Network Motifs and Graphlets Pedro Ribeiro Analyzing Complex Networks We have been talking about extracting information from networks Some possible tasks: General Patterns Ex: scale-free,
Blog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
IBM SPSS Direct Marketing 19
IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES
HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES Translating data into business value requires the right data mining and modeling techniques which uncover important patterns within
