SmartDispatch: Enabling Efficient Ticket Dispatch in an IT Service Environment
|
|
|
- Ann Alexander
- 10 years ago
- Views:
Transcription
1 SmartDispatch: Enabling Efficient Ticket Dispatch in an IT Service Environment Shivali Agarwal IBM Research Bengaluru, India Renuka Sindhgatta IBM Research Bengaluru, India Bikram Sengupta IBM Research Bengaluru, India ABSTRACT In an IT service delivery environment, the speedy dispatch of a ticket to the correct resolution group is the crucial first step in the problem resolution process. The size and complexity of such environments make the dispatch decision challenging, and incorrect routing by a human dispatcher can lead to significant delays that degrade customer satisfaction, and also have adverse financial implications for both the customer and the IT vendor. In this paper, we present SmartDispatch, a learning-based tool that seeks to automate the process of ticket dispatch while maintaining high accuracy levels. SmartDispatch comes with two classification approaches - the well-known SVM method, and a discriminative term-based approach that we designed to address some of the issues in SVM classification that were empirically observed. Using a combination of these approaches, SmartDispatch is able to automate the dispatch of a ticket to the correct resolution group for a large share of the tickets, while for the rest, it is able to suggest a short list of 3-5 groups that contain the correct resolution group with a high probability. Empirical evaluation of SmartDispatch on data from 3 large service engagement projects in IBM demonstrate the efficacy and practical utility of the approach. Categories and Subject Descriptors H.4.0 [Information Systems Applications]: General; I.5.4 [Pattern Recognition]: Applications Text Processing; I.5.2 [Pattern Recognition]: Design Methodology Feature Evaluation and Selection General Terms Design, Experimentation Keywords Ticket resolution group, SVM classification, Discriminative term weighting, Automated and advisory mode dispatch Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD 12, August 12 16, 2012, Beijing, China. Copyright 2012 ACM /12/08...$ INTRODUCTION Information technology (IT) is now a major enabler for most businesses. The size, diversity and complexity of such systems have increased manifold over the years, prompting organizations to outsource the maintenance of their IT systems to specialized service providers (IT vendors). These vendors employ skilled practitioners and organize them into teams with the responsibility of maintaining different parts of the IT system. For example, a team may be responsible for a specific process area of a large packaged application, a set of custom applications that leverage a common technology, or the like. For systems in production use, customers submit service requests to the vendor in the form of tickets, that represent a specific IT problem or need experienced by the end users (e.g. a failed transaction, user authentication expiry, data formatting issues etc.), and are generally small, atomic tasks that can be handled by a single practitioner within a short duration (e.g. a few minutes to a few days). The tickets are usually assigned to a practitioner in a twostep process. First, incoming tickets are received by a dispatcher, who reviews the problem description text (entered by the customer or end user) to try and understand which service team (henceforth called resolution group) is responsible for addressing the ticket, and then dispatches it to that group of practitioners. Next, within the group, the assignment of the ticket to a specific practitioner may be done by the group lead, or a practitioner may volunteer for the same, and this is usually based on criteria such as complexity of the ticket and expertise of the practitioner, the practitioner s availability and current workload etc. In this paper, we focus on the first step of the assignment process - namely the dispatching of a ticket to an appropriate resolution group based on the problem text description. Using text-based classification techniques applied on large-scale real-life data from IBM s services engagements, we study the extent to which this process may be automated and how practical tool support may be provided to aid the dispatcher when complete automation is not feasible. The dispatch of a ticket - as soon as it arrives - to the correct group of practitioners, is a critical step in the speedy resolution of a ticket. If the dispatcher misinterprets the problem and routes the ticket to an incorrect group, then significant time may be wasted before the group reviews the problem, determines it is not in their area of responsibility, and transfers it back to the dispatcher, to begin the process anew. This calls for timely and intelligent handling of tickets by the dispatcher. However, a number of factors make the dispatcher s job challenging. First, s/he needs to have a 1393
2 reasonably broad knowledge of the entire IT portfolio being managed, along with the roles and responsibilities of the individual groups. Second, s/he needs to be able to quickly parse the ticket text describing the problem and map it to the right group, which is often not straightforward given the heterogeneous and informal nature of the problem description reported by human users. While a senior practitioner with a good overview of the customer s IT system may be able to discharge these responsibilities well, such practitioners are usually preoccupied with helping their colleagues resolve the more complex tickets and the routine dispatch task (that is also labor intensive when the ticket volume is high) is often assigned to one of the less experienced practitioners. High attrition in service delivery teams may further compound the problem, as a dispatcher who had developed a certain amount of expertise over a period of time, may leave the organization, and a new practitioner filling in will take time to develop into the role. Incorrect dispatch decisions that result in such situations can significantly increase the total turnaround time for ticket resolution. For example, we observed in a study of an actual production support system that the average turnaround time for a ticket jumps by as much as 100% as the number of transfers increase from 2 to 4. When such delays occur, the customer s business may be severely impacted, customer satisfaction degrades, and the vendor may also need to compensate the customer by paying a penalty in case of a breach in Service Level Agreement (SLA). Inefficiencies in dispatch may thus have serious business consequences that an IT vendor can ill-afford. 1.1 Smart Dispatch The practical challenges associated with manual ticket dispatch motivated us to investigate how, and to what extent, we can automate the process of resolution group selection based on ticket text description. This has led to the development of a tool called SmartDispatch, which we describe and evaluate in this paper. Our goal is to have a performance baseline that is comparable to that of an expert human dispatcher, so the error rate (incorrect routings) should be low (within 10% of tickets). As our subjects of study, we selected large 3 production systems maintained by IBM (from 3 different domains and customers), where 2 of the systems had 25 resolution groups each, while the largest one had 79 resolution groups. The total number of tickets across these 3 data sets was more than 82,000. To train the SmartDispatch tool in taking intelligent routing decisions, we decided to use a supervised learning approach, with 60% of tickets in each set being used to train a classification engine, while the remaining 40% of tickets was used to test the accuracy of classification. We applied text processing on the ticket text to transform it into a weighted vector of terms, and then used the well-known Support Vector Machine (SVM) method to build the classification engine. Our experiments revealed that the accuracy of Smart- Dispatch with a full-automated approach ranged from 69% to 81% in the three systems, thereby implying an incorrect dispatch of 19% to 31% of the tickets. Given the business consequences of incorrect routing as noted above, we concluded that blanket automation may not be a practical solution to the dispatch problem, as it may lead to unacceptably high error rates. Next, we sought to determine the percentage of tickets we could automatically dispatch with a low error rate, by considering the confidence probabilities of classification as returned by the engine. The idea here is that if a reasonably high percentage of tickets can be correctly and automatically dispatched this way through the SmartDispatch system, it would significantly reduce the burden on an expert human dispatcher who can process the remaining tickets. As a threshold, we considered a confidence probability of 90% i.e. as long as the probability of a ticket belonging to a particular group was perceived to be >=90% by the classification engine, we would route the ticket to that group. We obtained mixed results using this approach. On the positive side, we found that when the confidence probability was >=90%, the error rate of SmartDispatch was only around 4%-6%. Unfortunately, we also found that while the percentage of tickets where the engine displayed such a high classification confidence was reasonably high for two data sets (55% to 62%) it was quite low (25%) for the other data set. We concluded that this was unlikely to be a useful generic approach when dispatching real-life service tickets, since the need for human involvement and judgement may continue to be very high. Given this, our next goal was to see if we can devise a classification approach that is able to consistently classify a significant percentage of tickets (at least >50%) with high confidence probability, and with low error rate. The reason why we felt this may still be feasible was that on closer review of ticket text and analysis of how SVM was handling the terms in the text, we observed that discriminative terms - words or phrases that seemed to characterize one or a small subset of resolution groups - appeared to be not particularly well-leveraged by SVM, even when we experimented with the weights assigned to these terms. This motivated us to design a classification approach - the discriminative term approach (DTA) - that assigns weights to terms using a function called inverse group frequency, which is inspired by the notion of inverse document frequency (idf) used by search engines to score and rank a document s relevance given a user query. In this approach, we assumed the classification engine to display a high confidence probability in a resolution group if it gave it a higher score than any other group, and in such cases, the ticket is automatically dispatched to that group. Using this new approach, we obtained strong results - across the 3 groups, the percentage of tickets that could be automatically dispatched ranged from 59% to as high as 72%, with a 100% precision rate in all cases. The final question to consider was how SmartDispatch could best handle tickets that did not have the necessary clarity for automatic dispatch to a specific group. Here again, the confidence scores returned by classification engines came in handy. We decided to adopt a dual mode dispatch strategy wherein SmartDispatch, when unable to automatically dispatch a ticket to a group, switches to an advisory mode and forwards the ticket to a human dispatcher, with suggestions on the top N groups the ticket may likely belong to (the groups having the top N confidence scores). Alternatively, if we still desire a fully automated system, SmartDispatch could offer the tickets through a limited broadcast to the top N groups, with the expectation that the correct group will identify the ticket as its own and take ownership. Here, interestingly, the results derived from SVM-based classification outperformed those derived from our discriminative term approach. In fact, very effective dispatch performance could be achieved by leveraging the 1394
3 complementary strengths of the two approaches - using the discriminative term approach for fully automated dispatch of a large percentage of tickets to the correct resolution group, and using SVM to advise on potential group(s) in case the decision was not unambiguous, with a combined error rate of well within 10%. Overall, our experience with SmartDispatch suggests that a combination of learning techniques, along with a dual mode dispatch strategy, can provide a satisfactory, near-automated solution to the problem of fast, accurate dispatch of service tickets to resolution groups. The rest of the paper is structured as follows: Section 2 describes the SmartDispatch tool design, architecture and the learning methods used in the tool; Section 3 presents evaluations results; related work is discussed in Section 4, while Section 5 concludes the paper. 2. SmartDispatch TOOL DESIGN The basic principle underlying the SmartDispatch tool is to build a prediction model for ticket dispatch using supervised learning on historical ticket data, and then use the model to guide subsequent dispatch decisions. Past ticket data is persisted in repositories, usually within the ticketing system itself. Two key fields recorded for each ticket are Description, which is a text field that describes the problem/need that needs to be addressed, and the Resolution Group, which is the team in which the practitioner who resolved the ticket belongs. While a ticket has many other fields as well (e.g. severity, open/close timestamps, practitioner name etc.) the (Description, Resolution Group) attributes are what we use to build a prediction model for dispatch based on ticket text. The prototypical text classification problem can be posed as done in [2]: Given a set of labelled text documents L = { x i, c i } L i=1 where ci C = {1, 2,..., C } denotes the category of document x i and C and L are the total number of predefined categories and labelled documents; learn a classifier that assigns a category label from 1 to C to each document in the fresh set U = x i U i=1. This is a supervised learning approach, wherein it is assumed that the joint probability distribution of documents and categories is identical in sets U and L (although this may not be guaranteed in practice). Figure 1 depicts the architecture of SmartDispatch, and we outline the different components below. 2.1 Text processing Text processing is necessary to generate a numeric form of description that can be consumed by classification methods like Support Vector Machine (SVM). The ticket description text is transformed into a vector space model. This includes extracting nouns and verbs as terms from the ticket description and reducing the terms to their morphological root. Stanford POS Tagger [11] is used to identify the nouns and verbs in the ticket description. Porter s stemming [7] is further used on the terms. Once the terms are extracted, any appropriate term weighting scheme can be applied on ticket description to obtain a vector representation. The text processing component of our tool is generic enough to handle data from different kinds of maintenance projects. The results presented in Section 3 have been arrived at by using this generic scheme of stop words and stemming. While account-specific dictionaries may help process ticket text more intelligently, such dictionaries may not always be available or up-to-date, hence the generic nature of text processing in SmartDispatch ensures wide applicability of the tool. 2.2 Generating Supervised Learning Model There exists a number of supervised learning techniques that may be used to learn a classification model that uses ticket text description to predict resolution group. The SmartDispatch tool comes with the well-known classification technique of Support Vector Machine, and also employs another learning technique based on discriminative terms that has been developed in house. Both of these approaches are described in detail below with appropriate insights and motivation Standard Classifier Approach We have chosen Support Vector Machine (SVM) as the standard classification engine in SmartDispatch based on existing literature [1] that puts it above the rest of the techniques for unstructured text. The choice was made after verifying this result with the sample data that we had. The tf*idf term weighting scheme was applied on each ticket description to obtain a vector representation for SVM. The SVM approach was found to be more robust than other techniques such as decision trees, Bayesian network, K-means etc. as it performed reasonably well for all data samples, and produced the best results in most cases. We also found the notion of confidence intervals reliable in case of SVMs and such intervals are the basis for designing the advisor module of our tool. The standard classifier approach will be used interchangeably with SVM based approach in the remaining paper. We have used the SVM implementation in SPSS [10], an industry standard statistical analysis package, within SmartDispatch. However, one of the issues with the SVM approach that we observed was the tendency of biased prediction in favor of groups having large volume of tickets. For example, if groups A and B have significantly higher number of tickets as compared to the rest of the groups (which also have sufficient number of tickets to learn the model), then the mis-classified tickets of the rest of the groups were often found to labelled as A or B. We tried using different forms of term weight functions but with little improvement. Upon case based inspection of mis-classified tickets, we figured out that SVM was not able to effectively exploit the discriminatory terms present in those tickets in many cases. This led to the design of a new approach based on discriminative terms described below. Such an approach was envisioned to complement the SVM approach Discriminative Term Approach While trying to use ticket descriptions for group prediction by exploiting the discriminative content that they possess, we developed a learning approach based on discriminative terms or keywords. The motivation for this approach stems from the fact that a resolution group typically has a set of discriminative terms related to the IT sub-system they manage, which are likely to appear in the tickets that come to them for resolution. If we identify the terms that correspond to each group by mining the descriptions associated with that group, then we can assign appropriate weights to all the terms in the entire data set to signify their extent of uniqueness with respect to groups. Using this as a model, the terms in a new ticket can be analyzed to arrive at a 1395
4 Figure 1: SmartDispatch Architecture Diagram group closeness score, thus predicting the group to which the ticket should be dispatched. For example, a term like error that say, occurs in 8 out of 10 groups, is not discriminative as compared to a term like password that may occur in only 2 out of 10 groups. Any ticket that consists of the term password will have high closeness score with respect to those groups. The steps involved in this approach consist of discriminative term weighting, term selection and association, and classifier function definition. We outline these below. Discriminative term weighting: We have defined a weight function called inverse group frequency (igf). This function abstracts away the role of number of documents (tickets) involved in arriving at the weight for a term and uses only the group cardinality. This ensures that the function performs equally well on small as well as large training datasets as long as the distribution of the sample is the same as the actual data. The formal definition is given below. Let T be the set of terms in the dataset. Let w t denote the discriminative weight of the term t T. Let N denote all the groups in the data and let N t denote the groups that have occurrence of t in them. The discriminative weight is given by the formula: w t = N/N t The value of w t is what we refer to as igf of the term as it is analogous in concept to idf. Higher the value of igf for a term, higher is the discriminative nature of the term. The value of igf is computed for all the terms in the document space. If needed, a threshold can be used to restrict the number of terms in the space by allowing only those terms whose frequency of occurrence is above the threshold. Term selection and association: Each resolution group needs to be associated with a set of terms that best describe it. To do this, we first find all the terms associated with a group through text processing of the ticket descriptions of all tickets belonging to this group. Next, we perform term selection for each group by discarding all terms that have igf value below a certain predefined threshold value. The threshold value can be easily arrived at by setting some lenient value upfront and then refining it over a period of time. Each group is thereby associated with a set of terms that meet the threshold criteria. Formally, we use G to denote the set of groups and T g to denote the terms that have been selected for the group g G. Classifier function: This is the step that pools all information about groups and their terms and uses it for prediction. The igf value of each term belonging to a group gives us an idea of how discriminating that term is for the group. This information is pooled to obtain closeness score of a ticket with respect to each group. For each ticket, the terms in the ticket are used to compute a closeness score using a well defined linear function given below, with respect to each group. The group(s) with the highest score is(are) the group(s) that the ticket is predicted to belong. More formally, let x denote a ticket description consisting of a set of terms T x, and let V al x(g) denote the score of ticket x with respect to group g G computed using the linear discriminant function as below: V al x(g) = t T g T x w t The use of discriminant function reduces the T dimensional space to N = G dimensional feature space, and V al x(g), g G defines the N-dimensional feature space. The classifier function for the ticket description x can be simply written as follows: ĝ(x) = arg max g GV al x(g) such that ĝ(x) denotes the group that is obtained by solving the straightforward optimization problem posed above using iterative technique. Note that there can be multiple groups with the same score. So, it is possible that for certain tickets, there is no total order on the group ranks based on score. 1396
5 This approach works best if descriptions have discriminative nature for distinct groups. The prediction model is very simple in this case and once the discriminative terms enter the system, the model stabilizes very fast. 2.3 Dispatch Decision Maker An incoming ticket for which a dispatch decision has to be made is first classified (mapped to resolution group(s)) using one of the supervised learning techniques described above. If there is sufficient confidence in the classification result, then the Dispatch Decision Maker will dispatch the ticket to that group. For SVM-based classification, we exercise this Auto Dispatch Policy if a ticket is classified to a group by SVM with a confidence probability of 0.9 or above. We chose this threshold based on the observation that the error rate of prediction by SVM is very low when the confidence probability is 0.9 or above, as we discuss in more detail during the evaluation of SmartDispatch in the next section. For the discriminatory term approach, where multiple groups may hold the top position based on having the same score through the discriminant function, we exercise the auto dispatch policy if and only if there is a clear winner i.e. there is only a single group at the top position. There will be cases, however, when none of the classification approaches return a clear winner. In such situations, SmartDispatch resorts to an Advisory Dispatch Policy. Here, the tool selects a small subset of groups that may be expected to contain the correct resolution group with reasonably high confidence. If an expert human dispatcher is available, the tool forwards the ticket to him/her with an advisory that contains the identities of these selected groups. The expectation here is that the expert dispatcher will be able to take a well-judged dispatch decision, and if the advisory of the tool contains the correct group, it would have done its job well. If an expert dispatcher is not available, and/or if full automation is desired, then SmartDispatch may forward the ticket to each of the selected groups through a limited broadcast, with the assumption that the correct group (if present in the selected list) will take ownership of the ticket and resolve it, while others will ignore the ticket. If the correct group is not included in the shortlist compiled by SmartDispatcher, we consider it as an error in its prediction (thus we slightly relax the notion of correct prediction in advisory mode by allowing a small number of other groups to appear in the shortlist, along with the correct resolution group). In case of SVM-based classification, we exercise the advisory dispatch policy for predictions with confidence probability less than 0.9, and generate a small ranked list of between 1 and 5 groups, in decreasing order of confidence probabilities. For the discriminatory term based approach, when there are multiple groups having the highest score, then all the groups that have the same highest score are suggested as possible groups in the advisory mode. 2.4 Feedback Mechanism The new tickets that have been dispatched by the system may have their resolution group updated in case of an incorrect dispatch. Other details (e.g. closure timestamps, resolving practitioner etc.) would also get updated, and upon closure, the final ticket details are persisted with in the ticket repository. The learning model should be generated at frequent intervals in order to leverage the new data available, and improve the prediction capabilities. 3. EVALUATION OF THE TOOL We have evaluated the SmartDispatch tool on three ticket data sets, derived from three different ongoing services engagements involving IBM. To preserve client confidentiality, these data sets will henceforth be denoted by Dataset A, Dataset B and Dataset C. In order to test the wide applicability of the tool, the datasets were derived from three different domains: Media and Entertainment (A), Automotive (B) and Consumer Goods (C). The number of tickets in each data set, the number of resolution groups, and the duration of the dataset (derived from the timestamps of the constituent tickets) are shown in Figure 2. As can be seen, each data set covers a reasonably wide time window (with a minimum of two months) and has a significantly large number of tickets. Dataset C, of course, is by far the largest, although spanning over the shortest duration, since it comes from an account with very high ticket inflow. To build and test the prediction models, each data set was split into a training set and a testing set using a 60:40 ratio. Figure 2: Dataset Overview We will now discuss the evaluation results. The results will be presented in the sequence in which the tool was incrementally developed as outlined in the Introduction. We will thus begin with SVM as the sole classification engine used for automated dispatch of all tickets (Section 3.1); next, we will consider automated dispatch of tickets having high prediction confidence, and incorporate the discriminative term approach to see if the volume of such tickets may be increased (Section 3.2); we will then evaluate the tool in the advisory dispatch mode for tickets that could not be automatically dispatched to a single group (Section 3.3); finally, we will consider a heterogeneous dispatch strategy leveraging DTA for automatic dispatch, and SVM for the advisory or limited broadcast mode (Section 3.4). 3.1 Fully automated approach using SVM The very first set of experiments were carried out to evaluate the performance of fully automated dispatch of all tickets based on supervised learning using SVM. The results of running SVM model on test data are shown in table 1. As can be seen, the percentage of misrouted tickets is quite high in Table 1: Full Auto-dispatch with SVM Dataset SVM precision (%) % Misrouted A 81% 19% B 69% 31% C 73% 27% 1397
6 the 3 data sets, particularly for Dataset B and Dataset C. The error rates are considerably higher than our target performance baseline of 10% (as mentioned in Section 1.1), and this suggested to us that full automation may not be practical. This led us to consider dispatch based on confidence scores, as we discuss next. 3.2 Auto-dispatch based on confidence scores The ticket classification results from SVM contain confidence probabilities associated with each resolution group. We wanted to test if we could improve prediction accuracy of automated dispatch by only selecting tickets where there was a high confidence probability in a specific group, and if so, whether we could dispatch a significantly large share of tickets this way, so as to reduce the burden on a human dispatcher. Our experiments with SVM suggested that as the confidence probability increased, error rates went down significantly. This is depicted for all three data sets in Figures 3, 4 and 5. The left chart in the figures depict the percentage of tickets that fall in the confidence intervals of 0-0.1, ,..., where each bar depicts an interval. The chart on the right side shows the split of percentage of correct and incorrect predictions for each of these confidence intervals. For example, Figure 3 for dataset A shows that about 55% tickets occur in the range of 0.9 and above, out of which 4% are incorrect and rest are all correct predictions. This implies that 55% tickets can be dispatched automatically with 96% precision. The results of auto dispatch policy for confidence level of 0.9 and above using SVM are shown in Table 2, columns two and three. We see that the error rates are low for all three data sets, ranging from only 4% to 6%. However, the percentage of tickets that can be routed this way with high confidence probability is not consistently high, and is, in fact, as low as 25% for Dataset B. This suggested to us that confidence-based routing using SVM, using thresholds that lead to low error rates, carries the risk of passing on a significant share of tickets to a human dispatcher for manual routing. As mentioned in Section 2.2.1, we also observed that SVM was often unable to effectively exploit discriminatory terms in ticket text, and we hypothesized that if we were able to devise a classification approach that leverages discriminatory terms well, then it may also be able to classify a higher number of tickets with sufficient confidence. This led us to design and incorporate the Discriminative Term Approach (DTA) within SmartDispatch. Table 2: Auto-dispatch mode based on confidence probability Dataset %tickets SVM SVM precision %tickets DTA DTA precision A 55% 96% 70% 100% B 25% 94% 72% 100% C 62% 96% 59% 100% The results for DTA are presented in Table 2. The tickets for which only one group obtained the highest score were the ones eligible for auto-dispatch and these percentage values are shown in column four of Table 2 for the three data sets. It is noteworthy that the accuracy of auto-dispatch was 100% for all datasets. The percentage of tickets that could be auto-dispatched range from a low of 59% to a high of 72%. The results are significantly better than SVM for Dataset A and Dataset B, and only marginally worse for Dataset C (though still better in terms of error rates). The threshold value for igf was set to 1 for all datasets. This is the most liberal setting and thus, the results are obtained for the most generic case. The evaluation indicates that DTA is a very good choice for auto dispatch of tickets. The relatively bad performance on Dataset C was due to large number of tickets having terms with low igf values. We believe that this problem can possibly be resolved by introducing a notion of discriminating phrase (consisting of a sequence of terms) in the learning method, which we plan to do in future. Next, we consider the handling of tickets which are ineligible for auto-dispatch. The advisory mode designed to tackle this set is evaluated next. 3.3 Advisory mode evaluation Table 3 presents evaluation results when the tool operates in advisory mode for SVM approach. These are obtained on the tickets that did not meet the criterion of auto-dispatch. The entry Top 3 in this table shows the percentage of times the correct group occurred in top 3 predicted groups ranked on confidence probability values. Similarly, Top 5 indicates the percentage of tickets that did not figure in top 3 but were fourth or fifth. The other columns are analogously defined. Thus, for Dataset A in the table, the correct resolution group occurred in top 3 of the ranked predicted groups 86.8% of the time (this was split as % at rank one, 28.6% at ranks two or three), and occurred in top 5 ( )% or 92.43% of the time. This mode performs very well for Dataset B as well, although for Dataset C (which has 79 resolution groups), the results are less satisfactory in absolute terms, with the correct group occurring in the top 5 in 81% cases. Overall, it can be concluded that the SVM approach is reasonably reliable for advisory mode. Note that a correct value can be predicted at rank one by SVM, although with a low absolute confidence score. As far as the advisory mode for DTA is concerned, all the tickets for which the top score is awarded to two or more groups are dispatched in advisory mode. Table 4 presents the results of this approach in advisory mode. Here again, only those tickets which were not eligible for auto-dispatch are the ones which are considered in advisory mode. The label Top 3 denotes the percentage of tickets out of this set for which there were 2 or 3 groups having highest score. Similarly, Top 5 denotes the set for which 4 or 5 groups have the top score. The rest of the labels are analogous to this. It can be seen that the precision falls at a very fast rate. For example, for dataset C, the correct resolution group appears in the top 3 in only 31.7% cases, thereby limiting the usefulness of the advisory mode in the DTA approach. 3.4 Combining SVM and DTA approaches The results in the preceding sections suggest that the SVM and DTA approaches have complementary strengths. When it comes to the percentage of tickets that can be automatically dispatched to the correct group with high confidence (low error rates), DTA outperforms SVM in general, on the strength of being able to distinguish such groups better using discriminative terms, with a perfect precision record. On the other hand, for tickets where such a distinction cannot 1398
7 Figure 3: Distribution of confidence probabilities for Dataset A and corresponding true vs. false predictions Figure 4: Distribution of confidence probabilities for Dataset B and corresponding true vs. false predictions Figure 5: Distribution of confidence probabilities for Dataset C and corresponding true vs. false predictions 1399
8 Table 3: Performance of SVM approach in advisory mode on the respective ticket set that it found ineligible for auto-dispatch Dataset Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 >30 A 86.8% 5.63% 4.76% 1.5% 1.09% 0.25% 0% B 88% 5.26% 3.98% 1.72% 0.74% 0.3% 0% C 76.4% 4.68% 6.45% 3.6% 2.36% 1.48% 5.03% Table 4: Performance of DTA in advisory mode on the respective ticket set that it found ineligible for auto-dispatch Dataset Top 3 Top 5 Top 10 Top 15 Top 20 Top 25 >30 A 50% 26% 20% 2% 2% 0% 0% B 39.2% 21.4% 25% 3.6% 3.6% 7.2% 0% C 31.7% 12.2% 17.1% 9.7% 7.4% 9.7% 12.2% be made unambiguously, and the dispatch has to switch to an advisory mode, SVM s performance is clearly much more robust. This motivated us to experiment with a heterogeneous approach, tapping into the respective strengths of both the techniques within a single dispatch decision: that is, for a new ticket, we use DTA for auto-dispatch when a single resolution group receives the highest score, otherwise, we handle it using SVM s advisory dispatch mode. The results for this combined approach are presented in Table 5. The percentage of tickets that that are automatically dispatched through DTA are shown in column two, and are derived from similar results presented earlier in Table 2. For each data set, the remaining tickets are dispatched in advisory (or limited broadcast) mode using SVM. Even when the advise (or broadcast) is limited to a single group (with the highest confidence probability), the error rate is very low, varying between 4% to 9%, and this comes down further to 2% to 6% when the top 3 resolution groups (by confidence score) returned by SVM are used. When compared to Table 1, we see that the percentage of misrouted tickets have substantially come down and this shows the effectiveness of the dual mode heterogeneous SmartDispatch approach over blanket automation using a single learning technique like SVM. The complementary capabilities of SVM and DTA thus make them a good package to have in the dispatch tool. While they can be used together in the manner outlined above, we can also run both the techniques on historical data from an engagement and compare the precision values on test data. The technique which is able to automate higher percentage of ticket dispatch with low error rates may be chosen. This will provide more flexibility to handle the differing nature of ticket descriptions in different engagements; for example, while it would make more sense to adopt the DTA approach to decide on auto-dispatch for data sets A and B, the SVM based approach may also be adopted in case of Dataset C (where the overall performance across the two approaches is almost at par). Besides the superior dispatch performance that results from this combination, the DTA option also provides scalability in learning since it can build a model very fast even when given a huge training set. At the same time, it can also handle learning with limited training data under the identical distribution assumption between training and actual data; this makes it very useful in cases where a new resolution group is added, but the ticket volume is not yet high enough to have a SVM-based model. 4. RELATED WORK Several researchers have studied different aspects of the problem of routing tickets to practitioners [8], [9], [6]. The work in [9] approaches the problem by mining resolution sequence data and does not access ticket description at all, thus being completely different from our approach. Its objective is to come up with ticket transfer recommendations given the initial assignment information. The work in [8] mines historical ticket data and develops a probabilistic model of an enterprise social network that represents the functional relationships among various expert groups in ticket routing. Based on this network, the system then provides routing recommendations to new tickets. This work also focuses on ticket transfers between groups (given an initial assignment) like [9] without looking at the ticket text content. The work in [6] is different and approaches the problem from a queue perspective. This is more related to the issue of service times and becomes particularly relevant when the ticket that has been dispatched to a group needs to be assigned to an agent. There are some papers which apply text classification techniques to handle tickets. [4] is close to our work and the objective is to automatically classify tickets based on description to route them to the right group. However, the work was applied on a small ticket set with only 8 groups. The accuracy of 84% in best case is something that was also not acceptable for the kind of reliable automation that we sought in our tool. The work in [3] attempts to classify the incoming change requests into one of the fine grained activities in a catalog by leveraging aggregated information associated with the change, like past failure reasons or best implementation practices. They use information retrieval and machine learning techniques in order to match change tickets to the activity that is most suitable for it. They suggest the top 5 activity group to the user as output and do not automate the process of assignment. Similar to our work, [2] shows the limitation of SVM-like techniques in terms of scalability and proposes a notion of discriminative keyword approach. However, we differ substantially in the defini- 1400
9 Table 5: Combined SVM and DTA evaluation %Top 1 % Top 1 %Top 3 Advisory Advisory Advisory (SVM) misrouted (SVM) Dataset % Automated (DTA) A 70% 26% 4% 28% 2% B 72% 21% 7% 22% 6% C 59% 32% 9% 37% 6% % Top 3 Advisory misrouted tion of our discriminative term weighting function and our ability to handle a higher dimensional feature space, as also in the application domain, with [2] being more focused on commonly used text classification data sets like personalized spam filtering and movie reviews, rather than service tickets. There has also been a large body of work on comparative analysis of various machine learning algorithms like SVM, decision trees etc. We used the conclusions in the work in [1] to use SVM as the algorithm for our analysis of ticket descriptions. We did carry out a preliminary analysis of comparing the precision and recall values of SVM, Bayesnet and C&RT and found SVM to be more robust and precise. We also referred to [5] to help decide what may be considered as a quality score for descriptions. 5. CONCLUSIONS Ticket dispatch plays an important role in determining the turnaround time of a ticket because a misrouting can introduce significant delays. We have proposed a tool called SmartDispatch for efficient dispatch of tickets in an IT service environment, using supervised learning techniques to review ticket descriptions and predict the most appropriate resolution group. The tool uses a combination of the standard classification algorithm SVM and a new discriminative term based heuristic for carrying out the dispatch, and offers automated as well as advisory dispatch capabilities. Empirical evaluation of the tool on large ticket data sets from real-life services engagements at IBM demonstrate the efficacy of the approach. As a part of future work, we plan to incorporate the notion of discriminative phrases to handle descriptions that do not have discriminative terms but a combination of terms, that is a phrase, is unique to the group. 6. REFERENCES [1] Shantanu Godbole and Shourya Roy. Text classification, business intelligence, and interactivity: automating c-sat analysis for services industry. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 08, pages , New York, NY, USA, ACM. [2] K.N. Junejo and A. Karim. A robust discriminative term weighting based linear discriminant method for text classification. In Proceedings of the Eighth IEEE International Conference on Data Mining, 2008., ICDM 08, pages IEEE, [3] Cristina Kadar, Dorothea Wiesmann, Jose Iria, Dirk Husemann, and Mario Lucic. Automatic classification of change requests for improved it service quality. In Proceedings of the 2011 Annual SRII Global Conference, SRII 11, pages , Washington, DC, USA, IEEE Computer Society. [4] G. di Lucca. An approach to classify software maintenance requests. In Proceedings of the International Conference on Software Maintenance (ICSM 02), pages 93, Washington, DC, USA, IEEE Computer Society. [5] Debapriyo Majumdar, Rose Catherine, Shajith Ikbal, and Karthik Visweswariah. Privacy protected knowledge management in services with emphasis on quality data. In Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM 11, pages , New York, NY, USA, ACM. [6] Hoda Parvin, Abhijit Bose, and Mark P. Van Oyen. Priority-based routing with strict deadlines and server flexibility under uncertainty. In Winter Simulation Conference, WSC 09, pages Winter Simulation Conference, [7] M. F. Porter. Readings in information retrieval. chapter An algorithm for suffix stripping, pages Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, [8] Qihong Shao, Yi Chen, Shu Tao, et al. Easyticket: a ticket routing recommendation engine for enterprise problem resolution. Proc. VLDB Endow., 1: , August [9] Qihong Shao, Yi Chen, Shu Tao, Xifeng Yan, and Nikos Anerousis. Efficient ticket routing by resolution sequence mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 08, pages , New York, NY, USA, ACM. [10] IBM SPSS. In [11] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL 03, pages , Stroudsburg, PA, USA, Association for Computational Linguistics. 1401
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Building A Smart Academic Advising System Using Association Rule Mining
Building A Smart Academic Advising System Using Association Rule Mining Raed Shatnawi +962795285056 [email protected] Qutaibah Althebyan +962796536277 [email protected] Baraq Ghalib & Mohammed
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Comparing Methods to Identify Defect Reports in a Change Management Database
Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support
DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support Rok Rupnik, Matjaž Kukar, Marko Bajec, Marjan Krisper University of Ljubljana, Faculty of Computer and Information
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
How To Write A Summary Of A Review
PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:
Prediction of Heart Disease Using Naïve Bayes Algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,
Measurement Information Model
mcgarry02.qxd 9/7/01 1:27 PM Page 13 2 Information Model This chapter describes one of the fundamental measurement concepts of Practical Software, the Information Model. The Information Model provides
A Game Theoretical Framework for Adversarial Learning
A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,
Blog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
An Information Retrieval using weighted Index Terms in Natural Language document collections
Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia
SPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
Applying Machine Learning to Stock Market Trading Bryce Taylor
Applying Machine Learning to Stock Market Trading Bryce Taylor Abstract: In an effort to emulate human investors who read publicly available materials in order to make decisions about their investments,
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
Spam Filtering with Naive Bayesian Classification
Spam Filtering with Naive Bayesian Classification Khuong An Nguyen Queens College University of Cambridge L101: Machine Learning for Language Processing MPhil in Advanced Computer Science 09-April-2011
Analyzing Customer Churn in the Software as a Service (SaaS) Industry
Analyzing Customer Churn in the Software as a Service (SaaS) Industry Ben Frank, Radford University Jeff Pittges, Radford University Abstract Predicting customer churn is a classic data mining problem.
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
Data Mining in Web Search Engine Optimization and User Assisted Rank Results
Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management
Term extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
Spam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
Spend Enrichment: Making better decisions starts with accurate data
IBM Software Industry Solutions Industry/Product Identifier Spend Enrichment: Making better decisions starts with accurate data Spend Enrichment: Making better decisions starts with accurate data Contents
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Machine Learning & Predictive Analytics for IT Services
Dorothea Wiesmann, IBM Research Zurich Machine Learning & Predictive Analytics for IT Services 1 Broad Application of Analytics and Data Science Various projects in IBM Research Zurich applying statistical
Online Failure Prediction in Cloud Datacenters
Online Failure Prediction in Cloud Datacenters Yukihiro Watanabe Yasuhide Matsumoto Once failures occur in a cloud datacenter accommodating a large number of virtual resources, they tend to spread rapidly
Models for Product Demand Forecasting with the Use of Judgmental Adjustments to Statistical Forecasts
Page 1 of 20 ISF 2008 Models for Product Demand Forecasting with the Use of Judgmental Adjustments to Statistical Forecasts Andrey Davydenko, Professor Robert Fildes [email protected] Lancaster
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
Sentiment Analysis on Big Data
SPAN White Paper!? Sentiment Analysis on Big Data Machine Learning Approach Several sources on the web provide deep insight about people s opinions on the products and services of various companies. Social
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2
Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
Sentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
A Survey on Product Aspect Ranking
A Survey on Product Aspect Ranking Charushila Patil 1, Prof. P. M. Chawan 2, Priyamvada Chauhan 3, Sonali Wankhede 4 M. Tech Student, Department of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra,
Bayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary [email protected] http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.
DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,
(Refer Slide Time: 01:52)
Software Engineering Prof. N. L. Sarda Computer Science & Engineering Indian Institute of Technology, Bombay Lecture - 2 Introduction to Software Engineering Challenges, Process Models etc (Part 2) This
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India
REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION
REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION Pilar Rey del Castillo May 2013 Introduction The exploitation of the vast amount of data originated from ICT tools and referring to a big variety
Cleaned Data. Recommendations
Call Center Data Analysis Megaputer Case Study in Text Mining Merete Hvalshagen www.megaputer.com Megaputer Intelligence, Inc. 120 West Seventh Street, Suite 10 Bloomington, IN 47404, USA +1 812-0-0110
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India [email protected] 2 School of Computer
Process improvement focused analysis of VINST IT support logs
Process improvement focused analysis of VINST IT support logs Shobana Radhakrishnan and Gokul Anantha SolutioNXT Inc. (www.solutionxt.com) Abstract The goal of this paper is to use existing logs and transactional
KEITH LEHNERT AND ERIC FRIEDRICH
MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015
Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Effective Mentor Suggestion System for Collaborative Learning
Effective Mentor Suggestion System for Collaborative Learning Advait Raut 1 U pasana G 2 Ramakrishna Bairi 3 Ganesh Ramakrishnan 2 (1) IBM, Bangalore, India, 560045 (2) IITB, Mumbai, India, 400076 (3)
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
How To Create A Text Classification System For Spam Filtering
Term Discrimination Based Robust Text Classification with Application to Email Spam Filtering PhD Thesis Khurum Nazir Junejo 2004-03-0018 Advisor: Dr. Asim Karim Department of Computer Science Syed Babar
A QoS-Aware Web Service Selection Based on Clustering
International Journal of Scientific and Research Publications, Volume 4, Issue 2, February 2014 1 A QoS-Aware Web Service Selection Based on Clustering R.Karthiban PG scholar, Computer Science and Engineering,
Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product
Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago [email protected] Keywords:
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Recognizing Informed Option Trading
Recognizing Informed Option Trading Alex Bain, Prabal Tiwaree, Kari Okamoto 1 Abstract While equity (stock) markets are generally efficient in discounting public information into stock prices, we believe
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute
Five Fundamental Data Quality Practices
Five Fundamental Data Quality Practices W H I T E PA P E R : DATA QUALITY & DATA INTEGRATION David Loshin WHITE PAPER: DATA QUALITY & DATA INTEGRATION Five Fundamental Data Quality Practices 2 INTRODUCTION
PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.
PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
Customer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Keywords : Data Warehouse, Data Warehouse Testing, Lifecycle based Testing
Volume 4, Issue 12, December 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Lifecycle
Tapping the benefits of business analytics and optimization
IBM Sales and Distribution Chemicals and Petroleum White Paper Tapping the benefits of business analytics and optimization A rich source of intelligence for the chemicals and petroleum industries 2 Tapping
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Gerard Mc Nulty Systems Optimisation Ltd [email protected]/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I
Gerard Mc Nulty Systems Optimisation Ltd [email protected]/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Detecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo
Detecting Email Spam MGS 8040, Data Mining Audrey Gies Matt Labbe Tatiana Restrepo 5 December 2011 INTRODUCTION This report describes a model that may be used to improve likelihood of recognizing undesirable
Financial Trading System using Combination of Textual and Numerical Data
Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,
Predictive Coding Defensibility and the Transparent Predictive Coding Workflow
WHITE PAPER: PREDICTIVE CODING DEFENSIBILITY........................................ Predictive Coding Defensibility and the Transparent Predictive Coding Workflow Who should read this paper Predictive
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II
Semi-Supervised Learning for Blog Classification
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Predictive Modeling for Collections of Accounts Receivable Sai Zeng IBM T.J. Watson Research Center Hawthorne, NY, 10523. Abstract
Paper Submission for ACM SIGKDD Workshop on Domain Driven Data Mining (DDDM2007) Predictive Modeling for Collections of Accounts Receivable Sai Zeng [email protected] Prem Melville Yorktown Heights, NY,
CHAPTER 1 INTRODUCTION
CHAPTER 1 INTRODUCTION 1.1 Background The command over cloud computing infrastructure is increasing with the growing demands of IT infrastructure during the changed business scenario of the 21 st Century.
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Predictive Coding Defensibility and the Transparent Predictive Coding Workflow
Predictive Coding Defensibility and the Transparent Predictive Coding Workflow Who should read this paper Predictive coding is one of the most promising technologies to reduce the high cost of review by
A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms
A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms Ian Davidson 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl country code 1st
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
Creating Synthetic Temporal Document Collections for Web Archive Benchmarking
Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 [email protected] 2 [email protected] Abstract A vast amount of assorted
