Task Management: An Iterative Relational Learning Approach

Size: px
Start display at page:

Download "Email Task Management: An Iterative Relational Learning Approach"

Transcription

1 Task Management: An Iterative Relational Learning Approach Rinat Khoussainov and Nicholas Kushmerick School of Computer Science and Informatics University College Dublin, Ireland {rinat, Abstract Today s clients were designed for yesterday s . Originally, was merely a communication medium. Today, people engage in a variety of complex behaviours using , such as project management, collaboration, meeting scheduling, to-do tracking, etc. Our goal is to develop automated techniques to help people manage complex activities or tasks in . The central challenge is that most activities are distributed over multiple messages, yet clients allow users to manipulate just isolated messages. We describe machine learning approaches to identifying tasks and relations between individual messages in a task (i.e., finding cause-response links between s) and for semantic message analysis (i.e., extracting metadata about how messages within a task relate to the task progress). Our key innovation compared to related work is that we exploit the relational structure of these two problems. Instead of attacking them separately, in our synergistic iterative approach, relations identification is used to assist semantic analysis, and vice versa. Our experiments with real-world corpora demonstrate an improvement compared to nonrelational benchmarks. 1 Introduction and Background A large proportion of every-day information comes to us in the form of natural language text, with being one of the biggest sources. Many people spend significant amounts of time handling their , and overload is becoming a critical problem [Whittaker and Sidner, 1996]. One of the reasons for this problem is that has transformed over the years from a communication media for simple message exchange, to a habitat [Ducheneaut and Bellotti, 2001] an environment where users engage in a variety of complex activities or tasks. Examples include meeting scheduling, using for reminders and to-do lists, e-commerce transactions, project management and collaborative work. In many cases, serves as the primary interface to one s workplace. However, clients offer little support for this activity oriented use of . programs are still designed mainly to manipulate individual messages. As a result, their advanced features in automated management are confined to filtering individual messages, and simple message threading. Our goal is provide task support in that would help users to manage their -based activities more effectively. In many cases, such activities manifest the user s participation in various structured processes or workflows. Such processes are represented in by groups of related messages, where each message can indicate some change of the process status. We are interested in being able to recognise such tasks in , organise messages within a task, and track the task progress. For example, a consumer purchasing from an e-commerce vendor may receive s that describe the current status of the transaction, such as order confirmation, notification of shipment, information about shipment delays or cancellations. A manager may receive a series of messages about hiring a new candidate, such as application documents, reminders about upcoming interviews, references, and decision notifications. A single user may be involved in many such activities simultaneously. A task-oriented client would allow the user to manage activities rather than separate messages. For instance, one can envisage the user being able to quickly inquire about the current status of unfinished e-commerce transactions or check the outcome of recent job interviews. Some process steps could be automated, such as automatically sending reminders for earlier requests. Similarly, the client could remind the user when her/his input is required in some activity or prominently notify him when an expected document arrives in a high-priority task.

2 Previous work in this area has mainly focused on two distinct problems: finding related messages and semantic message analysis. The goal of finding related messages is to group messages into tasks i.e., sets of messages reflecting the user s participation in some process or workflow and, possibly, to establish conversational links between messages within a task. Note that tasks need not correspond to threads: a task can have multiple conversation threads, and a thread can be related to several tasks. Specifically, In-Reply-To headers may not reflect the correct relations between messages in a task, since users often use the Reply button to start a new conversation, or neglect to use it when continuing an existing conversation. Likewise, organising s into folders can be orthogonal to tasks. For instance, users may have a single folder for urgent messages (often the Inbox), while these messages can be from different tasks. Semantic message analysis involves generating metadata for individual messages in a task that provides a link between the messages and the changes in the status of the underlying process, or the actions of the user in the underlying workflow. For example, a message can be described as an e-commerce order confirmation, a meeting proposal, or a delivery of information. Of course, such semantic analysis requires making assumptions regarding what actions are available to the user in an activity, what state transitions are possible in a process, etc. Notice, however, that such assumptions would not actually limit the user they would exist only in the client program to provide the advanced task-oriented functions. Our work is based on the idea that related messages in a task provide a valuable context that can be used for semantic message analysis. Similarly, the activity related metadata in separate messages can provide relational clues that can be used to establish links between messages and subsequently group them into tasks. (See section 3 for more detailed motivational examples). Instead of treating these two problems separately, we propose a synergetic iterative approach where identifying related messages is used to assist semantic message analysis and vice versa. Our key innovation compared to related work is that we exploit the relational structure of these two tasks. There has been substantial research recently on automatically finding related messages [Rhodes and Maes, 2000], sorting messages into folders [Crawford et al., 2002], and extracting a set of related messages (a task) from given a seed message [Dredze et al., 2004]. Also, several tools have been proposed that try to utilise this information to help users manage their [Windograd, 1987, Bellotti et al., 2003]. For semantic message analysis, [Cohen et al., 2004] proposed machine learning methods to classify s according to the intent of the sender, expressed in a verb-noun ontology of speech acts. Examples of such speech acts are Propose meeting, Deliver information, etc. Similar problem has also been tackled in [Horvitz, 1999], where a calendar management system is described based on classifying and extracting information from s having to do with meeting requests. We make the following contributions: (1) We investigate several methods for identifying relations between messages and grouping s into tasks. We use pair-wise message similarity to find potentially related messages, and hierarchical agglomerative clustering [Willet, 1988] is used to group messages into to tasks. We extend the message similarity function to take into account not only the textual similarity between messages, but also the available structured information in , such as send dates, and message subjects. (2) We propose a relational learning approach [Neville and Jensen, 2000] to task management that uses relations identification for semantic message analysis and vice versa. In particular, we investigate how (a) features of related messages in the same task can assist with classification of speech acts, and how (b) information about message speech acts can assist with finding related messages and grouping them into tasks. Combining these two methods yields an iterative relational algorithm for speech act classification and relations identification. (3) We evaluate our methods on real-life . The rest of the paper is organised as follows: Sec. 2 describes the corpora used in this study and introduces the speech acts, the semantic metadata that we use; Sec. 3 presents the overall approach and formulates the specific problems that we address; Sec. 4, 5, and 6 describe our three main contributions; finally, Sec. 7 concludes the paper with a discussion and outlook at the future work. 2 Corpora Although is ubiquitous, large and realistic corpora are rarely available for research purposes due to privacy concerns. We used two different real-world corpora for our study. Unfortunately, the need to manually annotate s has limited both the number of corpora used in our study and the size of each corpus. The first corpus contains messages that resulted from interaction between users and online e-commerce vendors [Kushmerick and Lau, 2005]. It contains messages from six vendors: half.com, eddiebauer.com, ebay.com, freshdirect.com, amazon.com and petfooddirect.com. These messages reflect the user s participation in online auctions and purchasing of goods in online shops. The corpus contains 111 messages from 39 business transactions. Examples of the messages include confirmations for online orders, shipment notifications, and auction bid status. Obviously, most of these messages were automatically generated by the vendor s software. The messages in this corpus were manually

3 grouped into tasks corresponding to different transactions and annotated with the task labels. The second corpus used in our study, the PWCALO corpus [Cohen et al., 2004], was generated during a four-day exercise conducted at SRI specifically to generate an corpus. During this time a group of six people assumed different work roles (e.g. project leader, finance manager, researcher) and performed a number of activities. Each has been manually annotated with two types of labels. The first type of labelling groups messages into tasks corresponding to separate conversations and establishes the links between messages similarly to the In- Reply-To header, except that in our case the links are semantic. That is, even if two messages do not have the In- Reply-To header that links them (in fact, none of the messages in our corpus had such headers) or they have different Subject headers, but are semantically related, they are linked together. One message becomes the cause of the other, and the latter message is a response to the former. The second type of annotation for this corpus represents the messages semantics. In particular, each message has been annotated according to the intent of the sender, expressed in a verb-noun ontology of speech acts. The ontology consists of verbs, such as Propose, Request, Deliver, and nouns, such as Information, Meeting, Data, Opinion. The details of the annotation ontology for this corpus are presented in [Cohen et al., 2004]. Each speech act is composed of a verb and a noun. Examples of such speech acts are Propose meeting, Deliver information, etc. A message may contain multiple acts. For the purposes of this study, we use only the 5 most frequent verbs, giving us 5 different speech acts: Propose, Request, Deliver, Commit, and Amend. To perform experiments in relational learning, we need to ensure that our training and testing sets are unrelated. So, we generated two independent sets from the PWCALO corpus. The first corpus, called User 1, contains all s sent or received by one participant (160 s in total). The second corpus, called User 2, contains all s sent or received by another participant but not received by the first one (33 s in total). 3 Problem Decomposition The idea behind our approach is that related messages in a task provide a valuable context that can be used for semantic message analysis. Similarly, the activity related metadata in separate messages can provide relational clues that can be used to establish links between s and group them into tasks. Instead of treating these two problems separately, we propose an iterative synergetic approach based on relational learning, where task identification is used to assist semantic message analysis and vice versa. 1: Identify initial relations (Problem 1) 2: Generate initial speech acts (Problem 2) loop 3: Use related s in the same task to clarify speech acts (Problem 3) 4: Use speech acts to clarify relations between messages (Problem 4) end loop Figure 1: Iterative relational learning approach to task management. In a non-relational approach, we would use the content of a message to assign speech act labels to this message. Similarly, we would use some content similarity between messages to decide whether these messages are related (and, hence, belong to the same task). In a relational approach to speech act classification, we can use both the message content and features of the related messages from the same task. More specifically, we propose to use the message content and the speech acts of related messages for deciding on speech acts of the given message. For example, if a message is a response to a meeting proposal, then it is more likely to be a meeting confirmation or refusal, and this additional evidence can be combined with information obtained from the message content. Similarly, we can use messages speech acts to improve relations identification. For example, if the suspected cause message is classified as meeting proposal and the suspected response message is classified as meeting confirmation, they are more likely to be related, than a pair where a request follows a confirmation. Therefore, we can identify the following four problems: Problem 1: Identify relations between s using content similarity only (i.e. without using messages speech acts); Problem 2: Classify messages into speech acts (semantic message analysis) using the messages content features only (i.e. without using information about related messages in the same task); Problem 3: Use the identified related messages to improve the quality of speech acts classification for a given message; Problem 4: Use the messages speech acts to improve identification of relations (links) between s. The outputs from these four problems can then be combined into a synergetic approach to task management based on an iterative relational classification algorithm illustrated in Figure 1. The presented algorithm can be viewed as the process of iterative re-assignment of speech act and relation labels and is similar to the collective classification procedure described in [Neville and Jensen, 2000].

4 In this paper, we investigate possible approaches to solving Problems 1, 3, and 4. For solving Problem 2, we use the standard text classification methods with bag-ofwords document representations which have been evaluated already in [Cohen et al., 2004]. Finally, we combine the proposed methods into the iterative classification algorithm and evaluate its performance. 4 Finding Related Messages (Problems 1, 4) Grouping messages into tasks. Our approach to grouping messages into tasks is based on a hierarchical agglomerative clustering (HAC) algorithm [Willet, 1988]. A HAC algorithm requires a similarity function between messages. The algorithm starts with each message assigned to a separate cluster and then keeps merging pairs of clusters until some threshold condition is met. The two clusters merged at each step are selected as the two most similar clusters. We use the complete link method, where the similarity of two clusters is given by the smallest pair-wise similarity of messages from each cluster. To derive the stopping condition for clustering, we calculate the average pairwise similarity for all messages in the corpus. The clustering stops when the similarity between the two most similar clusters becomes less than the corpus-wide average similarity. The clustering performance depends crucially on the similarity function. For text clustering, it is usually defined using the TF/IDF-based cosine similarity measure. In case of , however, we have additional structured information, such as the message subject and send date. We propose here the following modifications to the similarity function and the clustering method itself. Cos, S: The similarity is defined as the TF/IDF cosine similarity between message texts. However, the terms appearing in the subject are given a higher weight. The idea is that people usually try to summarise the content in the subject line and, hence, subject terms are more important. Cos, T: The similarity is defined as the cosine text similarity with time decay. The idea is that messages from the same task tend to be sent around the same time. Therefore, two messages with a large send time difference are less likely to be in the same task. We use the following formula: Sim(m 1, m 2 ) = Cosine Sim(m 1, m 2 ) exp( α Norm time diff (m 1, m 2 ), where Cosine Sim is the cosine message similarity, Norm time diff (m 1, m 2 ) is the time difference between messages divided by the maximum time difference, and α is a decay parameter. Cos, S+T: This method combines the subject-based message similarity with time decay (hopefully, combining the benefits of the two previous methods). SC: The average similarity between messages in large corpora is usually very small. This causes the clustering algorithm to produce a small number of large clusters with several tasks per cluster resulting in a higher task identification recall but lower precision (see below for definitions of precision and recall in this case). To address this problem, we run the HAC algorithm recursively on the clusters obtained after the top-level clustering, i.e. we sub-cluster them. Note, that the stopping criteria for each such sub-clustering is calculated within the given cluster, not corpus-wide. SC, TRW: Because different tasks may use different terminology, it may be possible to improve sub-clustering, if we adjust the term weights to reflect their importance within the given sub-cluster. To do this, we calculate IDFs within the given top-level cluster and then multiply the corpuswide IDFs by the cluster-specific IDFs. We call this modification term re-weighting. SC+Thr, TRW: Sub-clustering all top-level clusters indiscriminately may result in some single tasks being split across multiple clusters. This would result in a drop in task identification recall (see below for definitions of precision and recall in this case). We propose to use a sub-clustering threshold to prevent this. For each top-level cluster, we calculate its density as the average pair-wise similarity between messages in that cluster. We then calculate the average density and only sub-cluster the top-level clusters with a smaller than average density. We compared these various methods on both the e- commerce corpus and the free-text PWCALO . To evaluate our algorithm, we must compare the extracted tasks with a set of human-annotated tasks. Because there may be different number of tasks in each case, we cannot simply use traditional classification performance measures evaluating the agreement between two task labellings for a given data set. Instead, we must compare two clusterings: one is generated by our clustering algorithm; the other one (the correct clustering) is defined by a human annotator who partitioned s into tasks. To compare two clusterings, we use precision and recall measures as defined in [Kushmerick and Heß, 2003]. Consider all possible pair-wise combinations of messages in a given corpus. Let a be the number of pairs of messages that are correctly assigned to the same task (where the correct assignments are given by the human-generated clustering); let b be the number of message pairs that are incorrectly assigned to the same task; and let c be the number of message pairs that are incorrectly assigned to different tasks. The clustering precision is defined as a/(a + b), recall as a/(a + c), and F1 is their harmonic mean. Tables 1 and 2 present the corresponding results for different modifications to the algorithm as described above. The results show the importance of the additional features used in the message similarity function. In particular, we can see that taking into account the time difference between

5 Method Precision Recall F1 Cos Cos, S Cos, T Cos, S+T SC SC, TRW SC+Thr, TRW Table 1: Task identification: E-commerce corpus Method Precision Recall F1 Cos Cos, S Cos, T Cos, S+T SC SC, TRW SC+Thr, TRW Table 2: Task identification: Free-text corpus ( User 1 subset) messages has a significant effect on the task identification quality on the e-commerce. As we expected, using subclustering improves precision on both corpora, since it results in smaller more precise clusters. However, it also hurts recall, because some single tasks become split across multiple clusters. Using the sub-clustering threshold helps to fix this problem on both corpora, resulting in a much better recall with only a slight loss in precision. Finally, terms re-weighting proves very effective for the e-commerce corpus, but does not work so well for the free-text . The reason for this is that e-commerce s contain a lot of standard automatically generated text, which is not very helpful in distinguishing between different transactions. Terms re-weighting helps to make the unique transaction data (such as transaction ID or details of purchased goods) more prominent and, thus, more useful. Overall, our approach achieves a better performance on the e-commerce corpus than on the free-text . While this is an indication of the fact that free text is more difficult to work with, we believe it also shows the inherent difference between the two corpora. The tasks in e-commerce are well-structured: there is a clear understanding of how to group messages into tasks by mapping them onto the underlying business transactions. In free-text , the definition of a task is more fuzzy, and even human annotators may disagree on the correct task partitioning. Finding links between messages without using speech acts (Problem 1). In our PWCALO corpus, each message is a response to at most one other message (i.e. all conversations have a tree structure rather than a more generic DAG). Therefore, our algorithm for finding related messages proceeds as follows. For each in the corpus, we find the most similar preceding (in time) using the pair-wise message similarity function proposed in Sec. 4. We then Method Precision Recall F1 Cos Cos, T Cos, S Cos, S+T Table 3: Finding relations without using speech acts ( User 1 free-text subset) test for a pruning condition for these two messages (as described below) and, if the pruning test fails, we establish a link between the messages. There may be multiple pairs of messages having non-zero similarity in a corpus. Clearly, not all of these messages are actually related. Therefore, we would like to be able to prune the links suggested by the similarity function. One way is to use some threshold value: if the similarity between messages is below the threshold, then we would assume that the messages are not related. Table 3 presents the quality of the relations identification for different modifications of the similarity function tested on the User 1 subset of the PWCALO corpus. Finding links between messages using speech acts (Problem 4). We treat the problem of finding links between messages using speech acts as a supervised learning task. We assume that we have access to a training set, which provides the correct labels for both speech acts and message relations. The goal is to use this information to improve our performance on an unseen corpus. There are multiple ways for how the speech acts information can be taken into account when establishing links between messages. For example, we could calculate the empirical probabilities of one speech act following another in the training set, and use this information to modify the similarity function used on a test set. Since there can be numerous such modifications, we decided to rely instead on a classification algorithm to identify useful speech act patterns. From the given labelled corpus, we produce a set of training instances as follows. For each message in the corpus (child), we identify the most similar preceding message (parent) using the previously defined similarity function. For each such pair of messages, we create one training instance with one numeric feature for the similarity between messages, and two subsets of binary features for each possible speech act (10 features in total). The first binary subset is filled with speech acts of the parent message: 1 if the message has this speech act, 0 otherwise. The second binary subset if filled with speech acts of the child message. The class label for the instance is positive if the corresponding messages are related and negative otherwise. The resulting classifier can then be used to identify links between messages in an unseen corpus. To evaluate the potential for improvement from using speech acts, we tried to train and test a classifier on the

6 same User 1 subset of PWCALO. Obviously, if we cannot get any improvement even on the training set, it is unlikely we will do well on an unseen test set. We use the SMO implementation of support vector machines as our classification algorithm [Platt, 1999]. We obtain precision 0.91 and recall 0.80, so that F1 is Compared to Table 3, using speech acts were a more effective pruning method, resulting in the increase in precision with only marginal loss in recall. 5 Classifying Speech Acts (Problems 2, 3) Classifying speech acts without using related messages (Problem 2). We treat the problem of speech act classification as a supervised learning task. We use the standard text classification methods with bag-of-words document representations similar to [Cohen et al., 2004]. Each message is represented by a set of features, one for each term appearing in the s. The feature value is equal to the number of occurrences of the corresponding term in the message text. An important detail in speech acts classification is that s may contain quoted text from previous messages. While such text fragments are usually helpful in establishing links between messages (since they make messages more similar), they can confuse a speech acts classifier. Thus, we used a simple heuristic approach to identify the quoted text so that it can be removed before speech act classification. Classifying speech acts using related messages (Problem 3). We adopt here the relational learning terminology used in [Neville and Jensen, 2000]. Again, each message is converted to a data instance represented by a set of features. However, unlike in the previous case (i.e. Problem 2), these features can be derived now from the content of the given message (intrinsic features) as well as from the properties of related messages in the same task (extrinsic features). The goal is to learn how to assign speech acts to s from a training set data instances labelled with both the correct speech acts and relations. The question we are studying here is whether the extrinsic features, obtained using identification of related s, can help in classifying messages into speech acts. That is, we want to know whether speech acts of surrounding messages can help in classifying speech acts of a given message. To represent the intrinsic features of a message, we use the raw term frequencies as in Problem 2. To represent the extrinsic features of a message, we use the speech acts of related messages. As we explained earlier, in our PWCALO corpus each message can have at most one parent and possibly multiple children. Consequently, each message can be viewed as a response to its parent message and as a cause for its children messages. In addition to looking at the immediate ancestors and descendants of a message, we can also include features from several generations of ancestors and descendants (e.g. parents, grandparents, children, grandchildren). For each generation of related ancestor and descendant messages, we use a separate set of extrinsic features with one feature per each possible speech act. The value of 1 is distributed evenly between all features corresponding to the speech acts that occur in the related messages from a given generation of ancestors or descendants. For example, if children messages have Deliver and Commit speech acts, then the features of immediate descendants for these two acts are equal to 0.5 and the other features are equal to 0. Another way would be to use binary features for each speech act: 1 if the speech act is present, 0 otherwise. However, we found that the former method works better. The number of generations included into extrinsic features is regulated by the depth of lookup parameters. There are two lookup depth parameters, one for ancestor messages and one for descendant messages. For the depth of lookup 0, we use only intrinsic features. For the depth of lookup 1, we use extrinsic features for the immediate ancestor and descendants (parent and children). For the depth of lookup 2, we use extrinsic features for the immediate ancestors and descendants as well as for the ancestors of the immediate ancestors (grandparents) and the descendants of the immediate descendants (grandchildren), etc. Results. The experiments here are based on the User 1 corpus. We evaluate the performance of speech act classification using the human-annotated (correct) relations between messages and the correct speech acts for related messages. Essentially, we are interested in the question Can we improve speech act classification, if we know the correct links between s and the correct speech acts for the related messages? Notice that using the correct speech acts for related messages does not mean that we use the class label of an instance among its features. Each message uses only the speech acts of the related messages, but not its own speech acts. Of course, in a practical iterative classification method both the relations and the speech acts will have to be initialised using automatic methods (Problems 1, 2). The experiments here are used simply to confirm that improvements are possible in principle. For each speech act, we produce a separate binary classification problem where the goal is to identify whether the message has this act or not. The resulting data sets for classification can be highly imbalanced with a significant proportion of messages belonging to the same class (especially for rare or very frequent speech acts). Classification accuracy is not a good measure of performance on such data sets, so (following [Cohen et al., 2004]) we use the Kappa statistics instead. We use the SMO imple-

7 Ancest./Descend. 0/0 0/1 1/0 1/1 lookup depth Amend Commit v Deliver Propose Request v Table 4: Classifying speech acts using related s ( User 1 free-text subset) mentation of support vector machines as our classification algorithm [Platt, 1999] and the results are obtained in 5- fold cross-validations repeated 10 times for statistical significance testing (paired T-test). Table 4 shows the resulting Kappa values for different combinations of the lookup depth for ancestor/descendant messages. The statistically significant improvements over the base line (both lookup depths are zero) are marked with v (at the 0.05 confidence level). Lookup depths above 1 are not shown due to the lack of space. The results show statistically significant improvements for some speech acts ( Request, Commit ) and overall increases in the Kappa values. Of course, in a practical iterative classification method both the relations and the speech acts will have to be initialised using automatic methods (Problems 1, 2) and this is likely to affect the performance. 6 Iterative Algorithm for Task Management The results in the previous sections demonstrated that the proposed methods for solving Problems 1, 2, 3 and 4 as identified in Sec. 3 show some promise for improving performance of both relations identification and speech acts classification. However, the numbers were obtained assuming the knowledge of the correct (annotated) relations and/or speech acts in the test corpus. In this section, we put these separate methods together into an iterative relational algorithm and evaluate its performance in realistic settings, when the test corpus is unlabelled. The resulting algorithm is shown in Figure 2. The SMO algorithm was used for all classifiers. To obtain confidence scores for SMO, we used the distance from the hyper-plane normalised over all test instances. We use the similarity function with time decay (method Cos, S+T in Sec. 4) and threshold-based pruning to identify the initial links between messages (Problem 1). Results. In our experiments with this algorithm, we used the User 1 subset of PWCALO for the training set and the User 2 subset for the test set. On each iteration, we repeated the inner speech act classification loop 10 times (K = 10). Figure 3 shows how the speech acts classification performance was changing in the first iteration (note the difference between the main iterations and subfor all speech acts a do Train classifier C a on the training set to classify speech act a using only content (intrinsic features) Train classifier R a on the training set to classify speech act a using content and related s (intrinsic+extrinsic features) Train classifier L on the training set to classify links /* Problem 1 */ Set relations in the test set using similarity function /* Problem 2 */ Use classifiers C a to set speech acts in the test set /* Iterative classification */ for Iteration = 1... I do /* Problem 3 */ Theshold = 1 for Subiteration = 1... K do for all messages m in the test set do for all speech acts a do Obtain confidence for m has a using R a Obtain confidence for m has no a using R a For all cases where confidence for m has/has no a is greater than Threshold update speech acts of m accordingly Threshold = Threshold/2 Evaluate performance for speech acts /* Problem 4 */ Use L to find links between s in the test set, update the links accordingly Evaluate performance for relations Figure 2: Iterative algorithm for task management. iterations for speech acts classification!). The performance for almost all speech acts improved considerably (except for Amend ). The initial links identification resulted in precision=recall=f1=0.95. It improved after the first iteration to precision=1.0; recall=0.95; F1=0.98, and remained the same after the second and subsequent iterations (which is understandable, since speech acts mainly improve the links precision, and the precision after the first iteration was already 1). The performance of the speech acts classification in the second and subsequent iterations remained unchanged as well. That is, the iterative classification process converged very rapidly in this case. 7 Discussion In addition to mere communication, has become a ubiquitous channel for distributing information among collaborators, workflow participants, or even for sending to-do items to oneself. Unfortunately, today s clients offer very little high-level support for such complex behaviours. Developing such technologies is both a useful end in itself, and a major challenge for artificial intelligence. We have

8 Kappa Dlv Prop Req Cmt Amd of the classifiers between corpora and the degree to which speech act and link patterns are corpus and personindependent. However, this work requires more annotated data, and this is one of the key challenges for the research community at present. Acknowledgements: This research was funded by the Science Foundation Ireland and the US Office of Naval Research Sub-iterations Figure 3: Speech acts classification, Iteration 1 described a variety of machine learning methods for identifying relations between s and for analysis of message semantics. The key idea is that we synergetically exploit the relational structure of these two tasks. We proposed an iterative relational classification algorithm for task management that uses relations identification for semantic message analysis and vice versa. The proposed approach depends crucially on whether we can improve semantic message analysis by using the knowledge of the links structure; and whether we improve links identification by using semantic metadata in messages. We proposed methods for solving each of these problems and evaluated their potential effectiveness. Finally, we combined these methods into a single iterative algorithm and tested it on a real-world corpus. Our experiments demonstrate that: (1) Structured features in , such as message subject and send dates, can be very useful for identification of related messages. (2) The properties of related messages in the same task can be used to improve the message semantic analysis. In particular, the features of related messages in a task can improve the performance of the speech acts classification; (3) The semantic metadata in messages can be used to improve the quality of relations identification. In particular, taking into account speech acts of messages helps to improve identification of links between s. Finally, our combined iterative classification algorithm was able to simultaneously improve performance on both speech acts and message relations. These results provide an empirical evidence in favour of the proposed synergetic approach to task management. There are numerous directions for future work. The presented experiments are obviously quite limited and smallscale. Therefore, experiments with different and larger corpora can be used to better evaluate quantitatively the achievable improvements in performance. Such experiments could also allow us to study the transferability References [Bellotti et al., 2003] Bellotti, V., Ducheneaut, N., Howard, M., and Smith, I. (2003). Taking to task: the design and evaluation of a task management centered tool. In Proc. Conf. Human Factors in Computing Systems. [Cohen et al., 2004] Cohen, W., Carvalho, V., and Mitchell, T. (2004). Learning to classify into speech acts. In Empirical Methods in Natural Language Processing. [Crawford et al., 2002] Crawford, E., Kay, J.,, and McCreath, E. (2002). An intelligent interface for sorting electronic mail. In Proc. Intelligent User Interfaces. [Dredze et al., 2004] Dredze, M., Stylos, J., Lau, T., Kellogg, W., Danis, C., and Kushmerick, N. (2004). Taxie: Automatically identifying tasks in . In Unpublished manuscript. [Ducheneaut and Bellotti, 2001] Ducheneaut, N. and Bellotti, V. (2001). as habitat: An exploration of embedded personal information management. ACM Interactions, 8(1). [Horvitz, 1999] Horvitz, E. (1999). Principles of mixed-initiative user interfaces. In Proc. Conf. Human Factors in Computing Systems. [Kushmerick and Heß, 2003] Kushmerick, N. and Heß, A. (2003). Learning to attach semantic metadata to web services. In Proc. Int. Semantic Web Conf. [Kushmerick and Lau, 2005] Kushmerick, N. and Lau, T. (2005). Automated activity management: An unsupervised learning approach. In Proc. Int. Conf. Intelligent User Interfaces. [Neville and Jensen, 2000] Neville, J. and Jensen, D. (2000). Iterative classification in relational data. In AAAI Workshop on Learning Statistical Models from Relational Data. [Platt, 1999] Platt, J. C. (1999). Fast Training of Support Vector Machines using Sequential Minimal Optimization, chapter 12. MIT Press. [Rhodes and Maes, 2000] Rhodes, B. and Maes, P. (2000). Justin-time information retrieval agents. IBM Systems Journal, 39(3). [Whittaker and Sidner, 1996] Whittaker, S. and Sidner, C. (1996). overloading: Exploring personal information management of . In Proc. Conf. Human Factors in Computing Systems. [Willet, 1988] Willet, P. (1988). Recent trends in hierarchical document clustering. Information Processing and Management, 24. [Windograd, 1987] Windograd, T. (1987). A language/action perspective on the design of cooperative work. Human- Computer Interactions, 3(1).

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Email and PIM: Problems and Possibilities

Email and PIM: Problems and Possibilities This is the final author s version of the article published in Communications of the ACM : Bellotti, V., Gwizdka, J., & Whittaker, S. (2006). Email in personal information management. Communications in

More information

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type. Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Facilitating Business Process Discovery using Email Analysis

Facilitating Business Process Discovery using Email Analysis Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

Applying Machine Learning Techniques for Email Reply Prediction

Applying Machine Learning Techniques for Email Reply Prediction Applying Machine Learning Techniques for Email Reply Prediction Taiwo Ayodele, Shikun Zhou, Rinat Khusainov Abstract For several years now, email has grown rapidly as the most-used communications tool

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Colour Image Segmentation Technique for Screen Printing

Colour Image Segmentation Technique for Screen Printing 60 R.U. Hewage and D.U.J. Sonnadara Department of Physics, University of Colombo, Sri Lanka ABSTRACT Screen-printing is an industry with a large number of applications ranging from printing mobile phone

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

Bisecting K-Means for Clustering Web Log data

Bisecting K-Means for Clustering Web Log data Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining

More information

On the Collective Classification of Email Speech Acts

On the Collective Classification of Email Speech Acts On the Collective Classification of Email Speech Acts Vitor R. Carvalho Language Technologies Institute Carnegie Mellon University vitor@cs.cmu.edu William W. Cohen Center for Automated Learning and Discovery

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

3 An Illustrative Example

3 An Illustrative Example Objectives An Illustrative Example Objectives - Theory and Examples -2 Problem Statement -2 Perceptron - Two-Input Case -4 Pattern Recognition Example -5 Hamming Network -8 Feedforward Layer -8 Recurrent

More information

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

More information

Detecting E-mail Spam Using Spam Word Associations

Detecting E-mail Spam Using Spam Word Associations Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 p10co977@coed.svnit.ac.in 2 dpr@coed.svnit.ac.in

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

Regular Expressions and Automata using Haskell

Regular Expressions and Automata using Haskell Regular Expressions and Automata using Haskell Simon Thompson Computing Laboratory University of Kent at Canterbury January 2000 Contents 1 Introduction 2 2 Regular Expressions 2 3 Matching regular expressions

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016 Network Machine Learning Research Group S. Jiang Internet-Draft Huawei Technologies Co., Ltd Intended status: Informational October 19, 2015 Expires: April 21, 2016 Abstract Network Machine Learning draft-jiang-nmlrg-network-machine-learning-00

More information

Ontology construction on a cloud computing platform

Ontology construction on a cloud computing platform Ontology construction on a cloud computing platform Exposé for a Bachelor's thesis in Computer science - Knowledge management in bioinformatics Tobias Heintz 1 Motivation 1.1 Introduction PhenomicDB is

More information

A Survey on Product Aspect Ranking

A Survey on Product Aspect Ranking A Survey on Product Aspect Ranking Charushila Patil 1, Prof. P. M. Chawan 2, Priyamvada Chauhan 3, Sonali Wankhede 4 M. Tech Student, Department of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra,

More information

GOAL-BASED INTELLIGENT AGENTS

GOAL-BASED INTELLIGENT AGENTS International Journal of Information Technology, Vol. 9 No. 1 GOAL-BASED INTELLIGENT AGENTS Zhiqi Shen, Robert Gay and Xuehong Tao ICIS, School of EEE, Nanyang Technological University, Singapore 639798

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

Electronic Document Management Using Inverted Files System

Electronic Document Management Using Inverted Files System EPJ Web of Conferences 68, 0 00 04 (2014) DOI: 10.1051/ epjconf/ 20146800004 C Owned by the authors, published by EDP Sciences, 2014 Electronic Document Management Using Inverted Files System Derwin Suhartono,

More information

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Samarjit Chakraborty Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology (ETH) Zürich March

More information

III. DATA SETS. Training the Matching Model

III. DATA SETS. Training the Matching Model A Machine-Learning Approach to Discovering Company Home Pages Wojciech Gryc Oxford Internet Institute University of Oxford Oxford, UK OX1 3JS Email: wojciech.gryc@oii.ox.ac.uk Prem Melville IBM T.J. Watson

More information

Meeting Scheduling with Multi Agent Systems: Design and Implementation

Meeting Scheduling with Multi Agent Systems: Design and Implementation Proceedings of the 6th WSEAS Int. Conf. on Software Engineering, Parallel and Distributed Systems, Corfu Island, Greece, February 16-19, 2007 92 Meeting Scheduling with Multi Agent Systems: Design and

More information

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28 Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Towards Effective Recommendation of Social Data across Social Networking Sites

Towards Effective Recommendation of Social Data across Social Networking Sites Towards Effective Recommendation of Social Data across Social Networking Sites Yuan Wang 1,JieZhang 2, and Julita Vassileva 1 1 Department of Computer Science, University of Saskatchewan, Canada {yuw193,jiv}@cs.usask.ca

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Usability metrics for software components

Usability metrics for software components Usability metrics for software components Manuel F. Bertoa and Antonio Vallecillo Dpto. Lenguajes y Ciencias de la Computación. Universidad de Málaga. {bertoa,av}@lcc.uma.es Abstract. The need to select

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Personalized Hierarchical Clustering

Personalized Hierarchical Clustering Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

More information

Professor Anita Wasilewska. Classification Lecture Notes

Professor Anita Wasilewska. Classification Lecture Notes Professor Anita Wasilewska Classification Lecture Notes Classification (Data Mining Book Chapters 5 and 7) PART ONE: Supervised learning and Classification Data format: training and test data Concept,

More information

Representation of Electronic Mail Filtering Profiles: A User Study

Representation of Electronic Mail Filtering Profiles: A User Study Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 +1 949 824 5888 pazzani@ics.uci.edu

More information

Analysis and Synthesis of Help-desk Responses

Analysis and Synthesis of Help-desk Responses Analysis and Synthesis of Help-desk s Yuval Marom and Ingrid Zukerman School of Computer Science and Software Engineering Monash University Clayton, VICTORIA 3800, AUSTRALIA {yuvalm,ingrid}@csse.monash.edu.au

More information

Bayesian Spam Filtering

Bayesian Spam Filtering Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

More information

Movie Classification Using k-means and Hierarchical Clustering

Movie Classification Using k-means and Hierarchical Clustering Movie Classification Using k-means and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DA-IICT, Gandhinagar Gujarat, India dharak_shah@daiict.ac.in Saheb Motiani

More information

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team Lecture Summary In this lecture, we learned about the ADT Priority Queue. A

More information

Concepts of digital forensics

Concepts of digital forensics Chapter 3 Concepts of digital forensics Digital forensics is a branch of forensic science concerned with the use of digital information (produced, stored and transmitted by computers) as source of evidence

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

Feature Selection for Electronic Negotiation Texts

Feature Selection for Electronic Negotiation Texts Feature Selection for Electronic Negotiation Texts Marina Sokolova, Vivi Nastase, Mohak Shah and Stan Szpakowicz School of Information Technology and Engineering, University of Ottawa, Ottawa ON, K1N 6N5,

More information

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION Pilar Rey del Castillo May 2013 Introduction The exploitation of the vast amount of data originated from ICT tools and referring to a big variety

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

!!!#$$%&'()*+$(,%!#$%$&'()*%(+,'-*&./#-$&'(-&(0*.$#-$1(2&.3$'45 !"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

More information

Data Mining in Personal Email Management

Data Mining in Personal Email Management Data Mining in Personal Email Management Gunjan Soni E-mail is still a popular mode of Internet communication and contains a large percentage of every-day information. Hence, email overload has grown over

More information

A semi-supervised Spam mail detector

A semi-supervised Spam mail detector A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach

More information

KPN SMS mail. Send SMS as fast as e-mail!

KPN SMS mail. Send SMS as fast as e-mail! KPN SMS mail Send SMS as fast as e-mail! Quick start Start using KPN SMS mail in 5 steps If you want to install and use KPN SMS mail quickly, without reading the user guide, follow the next five steps.

More information

Reputation Network Analysis for Email Filtering

Reputation Network Analysis for Email Filtering Reputation Network Analysis for Email Filtering Jennifer Golbeck, James Hendler University of Maryland, College Park MINDSWAP 8400 Baltimore Avenue College Park, MD 20742 {golbeck, hendler}@cs.umd.edu

More information

PDF Primer PDF. White Paper

PDF Primer PDF. White Paper White Paper PDF Primer PDF What is PDF and what is it good for? How does PDF manage content? How is a PDF file structured? What are its capabilities? What are its limitations? Version: 1.0 Date: October

More information

KS3 Computing Group 1 Programme of Study 2015 2016 2 hours per week

KS3 Computing Group 1 Programme of Study 2015 2016 2 hours per week 1 07/09/15 2 14/09/15 3 21/09/15 4 28/09/15 Communication and Networks esafety Obtains content from the World Wide Web using a web browser. Understands the importance of communicating safely and respectfully

More information

Big Data Text Mining and Visualization. Anton Heijs

Big Data Text Mining and Visualization. Anton Heijs Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark

More information

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

More information

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering

More information

Collated Food Requirements. Received orders. Resolved orders. 4 Check for discrepancies * Unmatched orders

Collated Food Requirements. Received orders. Resolved orders. 4 Check for discrepancies * Unmatched orders Introduction to Data Flow Diagrams What are Data Flow Diagrams? Data Flow Diagrams (DFDs) model that perspective of the system that is most readily understood by users the flow of information around the

More information

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013 ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION, Fuel Consulting, LLC May 2013 DATA AND ANALYSIS INTERACTION Understanding the content, accuracy, source, and completeness of data is critical to the

More information

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that

More information

The Second International Timetabling Competition (ITC-2007): Curriculum-based Course Timetabling (Track 3)

The Second International Timetabling Competition (ITC-2007): Curriculum-based Course Timetabling (Track 3) The Second International Timetabling Competition (ITC-2007): Curriculum-based Course Timetabling (Track 3) preliminary presentation Luca Di Gaspero and Andrea Schaerf DIEGM, University of Udine via delle

More information

Semi-Supervised Learning for Blog Classification

Semi-Supervised Learning for Blog Classification Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,

More information

Comparing Methods to Identify Defect Reports in a Change Management Database

Comparing Methods to Identify Defect Reports in a Change Management Database Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com

More information

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia

More information

Classifying and Identifying of Threats in E-mails Using Data Mining Techniques

Classifying and Identifying of Threats in E-mails Using Data Mining Techniques Classifying and Identifying of Threats in E-mails Using Data Mining Techniques D.V. Chandra Shekar and S.Sagar Imambi Abstract E-mail has become one of the most ubiquitous methods of communication. The

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

More information

SVM Based Learning System For Information Extraction

SVM Based Learning System For Information Extraction SVM Based Learning System For Information Extraction Yaoyong Li, Kalina Bontcheva, and Hamish Cunningham Department of Computer Science, The University of Sheffield, Sheffield, S1 4DP, UK {yaoyong,kalina,hamish}@dcs.shef.ac.uk

More information

1.1 Difficulty in Fault Localization in Large-Scale Computing Systems

1.1 Difficulty in Fault Localization in Large-Scale Computing Systems Chapter 1 Introduction System failures have been one of the biggest obstacles in operating today s largescale computing systems. Fault localization, i.e., identifying direct or indirect causes of failures,

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

OUTLIER ANALYSIS. Data Mining 1

OUTLIER ANALYSIS. Data Mining 1 OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Personalization of Web Search With Protected Privacy

Personalization of Web Search With Protected Privacy Personalization of Web Search With Protected Privacy S.S DIVYA, R.RUBINI,P.EZHIL Final year, Information Technology,KarpagaVinayaga College Engineering and Technology, Kanchipuram [D.t] Final year, Information

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

Model Assignment Issued July 2013 Level 4 Diploma in Business and Administration

Model Assignment Issued July 2013 Level 4 Diploma in Business and Administration Model Assignment Issued July 2013 Level 4 Diploma in Business and Administration Unit 1 Supporting Business Activities Please note: This OCR model assignment may be used to provide evidence for the unit

More information

Reconciliation Best Practice

Reconciliation Best Practice INTRODUCTION This paper provides an outline statement of what we consider to be best practice with respect to the use of reconciliation software used by asset managers. It is not a description of any particular

More information

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

More information

Fuzzy Duplicate Detection on XML Data

Fuzzy Duplicate Detection on XML Data Fuzzy Duplicate Detection on XML Data Melanie Weis Humboldt-Universität zu Berlin Unter den Linden 6, Berlin, Germany mweis@informatik.hu-berlin.de Abstract XML is popular for data exchange and data publishing

More information

How To Cluster Of Complex Systems

How To Cluster Of Complex Systems Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Why is Internal Audit so Hard?

Why is Internal Audit so Hard? Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets

More information

INVENTORY MANAGEMENT, SERVICE LEVEL AND SAFETY STOCK

INVENTORY MANAGEMENT, SERVICE LEVEL AND SAFETY STOCK INVENTORY MANAGEMENT, SERVICE LEVEL AND SAFETY STOCK Alin Constantin RĂDĂŞANU Alexandru Ioan Cuza University, Iaşi, Romania, alin.radasanu@ropharma.ro Abstract: There are many studies that emphasize as

More information