Search Engine Based Intelligent Help Desk System: iassist



Similar documents
iassist:an Intelligent Online Assistance System

HIGH-QUALITY customer service is extremely important

Natural Language to Relational Query by Using Parsing Compiler

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Search Result Optimization using Annotators

HELP DESK SYSTEMS. Using CaseBased Reasoning


Movie Classification Using k-means and Hierarchical Clustering

Big Data Summarization Using Semantic. Feture for IoT on Cloud

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

AN APPROACH TO ANTICIPATE MISSING ITEMS IN SHOPPING CARTS

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

Optimization of Internet Search based on Noun Phrases and Clustering Techniques

Interactive Dynamic Information Extraction

A Survey on Product Aspect Ranking

Financial Trading System using Combination of Textual and Numerical Data

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

How To Write A Summary Of A Review

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Building a Question Classifier for a TREC-Style Question Answering System

Analysis of Social Media Streams

An Overview of Knowledge Discovery Database and Data mining Techniques

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Volume 2, Issue 11, November 2014 International Journal of Advance Research in Computer Science and Management Studies

Expert Finding Using Social Networking

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

INF5820 Natural Language Processing - NLP. H2009 Jan Tore Lønning jtl@ifi.uio.no

A QoS-Aware Web Service Selection Based on Clustering

Automatic Annotation Wrapper Generation and Mining Web Database Search Result

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

TREC 2007 ciqa Task: University of Maryland

Search and Information Retrieval

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

I. INTRODUCTION NOESIS ONTOLOGIES SEMANTICS AND ANNOTATION

Clustering Connectionist and Statistical Language Processing

Clustering Technique in Data Mining for Text Documents

Experiments in Web Page Classification for Semantic Web

Term extraction for user profiling: evaluation by the user

Natural Language Database Interface for the Community Based Monitoring System *

Semantic Search in Portals using Ontologies

Precision and Relative Recall of Search Engines: A Comparative Study of Google and Yahoo

Fast Contextual Preference Scoring of Database Tuples

A Direct Numerical Method for Observability Analysis

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Bitmap Index an Efficient Approach to Improve Performance of Data Warehouse Queries

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing

RRSS - Rating Reviews Support System purpose built for movies recommendation

Fuzzy-Set Based Information Retrieval for Advanced Help Desk

Text Classification Using Symbolic Data Analysis

How To Cluster On A Search Engine

Spam Detection Using Customized SimHash Function

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

DESKTOP BASED RECOMMENDATION SYSTEM FOR CAMPUS RECRUITMENT USING MAHOUT

IT services for analyses of various data samples

How To Use Neural Networks In Data Mining

How To Fix Out Of Focus And Blur Images With A Dynamic Template Matching Algorithm

Keywords: Information Retrieval, Vector Space Model, Database, Similarity Measure, Genetic Algorithm.

Analysis and Synthesis of Help-desk Responses

Active Learning SVM for Blogs recommendation

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Karagpur

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Practical Graph Mining with R. 5. Link Analysis

A Survey on Product Aspect Ranking Techniques

Stabilization by Conceptual Duplication in Adaptive Resonance Theory

Domain Classification of Technical Terms Using the Web

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

An ontology-based approach for semantic ranking of the web search engines results

A Case Retrieval Method for Knowledge-Based Software Process Tailoring Using Structural Similarity

Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products

Natural Language Query Processing for Relational Database using EFFCN Algorithm

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS

PULLING OUT OPINION TARGETS AND OPINION WORDS FROM REVIEWS BASED ON THE WORD ALIGNMENT MODEL AND USING TOPICAL WORD TRIGGER MODEL

Component visualization methods for large legacy software in C/C++

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Optimization of Image Search from Photo Sharing Websites Using Personal Data

SPATIAL DATA CLASSIFICATION AND DATA MINING

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Florida International University - University of Miami TRECVID 2014

NATURAL LANGUAGE TO SQL CONVERSION SYSTEM

Regression model approach to predict missing values in the Excel sheet databases

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

Personalization of Web Search With Protected Privacy

Semantic Concept Based Retrieval of Software Bug Report with Feedback

An Empirical Approach for Document Clustering in Forensic Analysis: A Review

How To Solve The Kd Cup 2010 Challenge

Inner Classification of Clusters for Online News

Transcription:

Search Engine Based Intelligent Help Desk System: iassist Sahil K. Shah, Prof. Sheetal A. Takale Information Technology Department VPCOE, Baramati, Maharashtra, India sahilshahwnr@gmail.com, sheetaltakale@gmail.com Abstract: Intelligent Help Desk System is the need of every individual. Many organizations use Case Based Help Desk System. Maintaining an up-to-date Case History for each and every problem is difficult and costly. Search Engine is doing the task of intelligent help for all the users of internet. For a given user keyword query, current web search engines return a list of individual web pages. However, information for the query is often spread across multiple pages. This degrades the quality of search results. To address these challenges, an Intelligent Help Desk System: iassist is developed. It is based on search engine results as case history of the user query. The semantic relevance of the search results with the user query is computed using NEC SENNA and WordNet. The proposed system ranks the search results based on their semantic relevance to the request. These relevant results are grouped into different clusters based on MDL principle and symmetric matrix factorization. Each cluster is summarized to generate recommended solutions. For performance analysis the system is tested using user survey. Experiments conducted demonstrate the effectiveness of iassist in semantic text understanding, document clustering and summarization. The better performance of iassist benefits from the sentence level semantic analysis, clustering using MDL principle and SNMF. Keywords: Intelligent Helpdesk, Semantic Similarity, Web Search Result Summarization, Document Summarization I. INTRODUCTION Intelligent Help Desk System is the need of every individual. Many organizations use Case Based Help Desk System to improve the quality of customer service. For a given customer request, an intelligent helpdesk system tries to find the earlier similar requests and the case history associated with the request. Helpdesk systems usually use databases to store past interactions between customers and companies. Interactions may be descriptions of a problem and recommended solutions. Major challenge faced by these help desk systems is maintenance of up-to-date case history. Maintaining an up-to-date Case History for each and every problem is difficult and costly. Search Engine is doing the task of intelligent help for all the users of internet. However, content on the Web and Enterprise Intranet is increasing day by day. The web is a vast collection of completely uncontrolled heterogeneous documents. It is huge, Diverse, and dynamic. For a user keyword query, current Web Search Engines return a list of pages with respect to the query. However, the information for a topic, especially for multitopic queries in which individual query keywords occur relatively frequently in the document collection but rarely occur together in the same document, is often distributed among multiple physical pages. So the search engines are drowning in information, but starving for knowledge. To address the challenges faced by present help desk system and web search engines, we have developed an online helpdesk system: iassist. It automatically finds problem-solution pattern from web using search engines like Google, Yahoo, etc. For a given user query, iassist interacts with the search engine to retrieve the relevant solutions. These retrieved solutions are ranked based on their semantic similarity with user query. Semantic similarity is based on semantic roles and semantic meanings. II. LITERATURE SURVEY Case-based systems have been developed to interactively search the solution space by suggesting the most informative questions to ask [2,5]. These systems use the initial information to retrieve the first candidate set and then ask the user questions to narrow down until few cases remain or the most suitable items are found. When the description of cases or items becomes complicated, these case-based systems suffer from the curse of dimensionality, and the similarity/ distance between cases or items becomes difficult to measure. Furthermore, the similarity measurements used in these systems usually are based on keyword matching, which lacks the semantic analysis of customer requests and existing cases. Help desk systems based on database search and ranking are also developed. Many methods have been proposed to perform similarity search and rank results of a query [6]. However, similar to the case-based systems, the similarity is measured based on keyword matching, which have difficulty to understand the semantics and context of text deeply. Existing search engines often return a long list of search results. Clustering technologies are often used in search result organization [7]. However, the existing document-clustering algorithms do not consider the impact of the general and common information contained in the documents. In our work, by filtering out this common information, the clustering quality can be improved, and better context organizations can then be obtained. III. SYSTEM ARCHITECTURE Figure 1 shows system architecture of iassist. System works in five modules: Preprocessing, Case Ranking, Document Clustering, Sentence Clustering and Sentence Cluster Summarization. As shown in figure, input to the system is user query in the form of question. The

system retrieves relevant solutions or past cases from search engine. Pre-processing of user query and past cases involves removal of non-words, then each of the retrieved document is truncated into sentences and passed through semantic role parser for semantic role labeling. Case ranking module ranks the retrieved documents based on their sentence level semantic similarity with user query. Semantically ranked documents need to be grouped according the context. Top ranking documents are clustered using Minimum Description Length (MDL) principle [1]. Sentence Clustering Module groups sentences having similar meaning into a cluster using Symmetric Non-negative Matrix Factorization (SNMF) [3]. Sentence Cluster Summarization module selects most relevant sentences from each cluster in order to form a concise summary which is represented as reference solution to the user. IV. PREPROCESSING It is essential to consider only meaningful words and to remove the redundancy in documents as well as to reduce the document size. So, preprocessing of problemsolution pattern involves removal of non-words from both the user query and documents retrieved from search engine. Further, each sentence in the retrieved document is passed to a semantic role parser to find semantic meaning of each sentence based on frames (or verbs) in a sentence. Semantic role labeling Semantic role labeling, sometimes also called shallow semantic parsing, is a task in natural language processing consisting of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles. A semantic role is a description of the relationship that a constituent plays with respect to the verb in the sentence. For example, given a sentence like Riya sold the book to Abbas", the task would be to recognize the verb "to sell" as representing the predicate, "Riya" as representing the seller (agent), "the book" as representing the goods (theme), and "Abbas" as representing the recipient. This is an important step towards making sense of the meaning of a sentence. A semantic representation of this sort is at a higher-level of abstraction than a syntax tree. For instance, the sentence "The book was sold by Riya to Abbas" has a different syntactic form, but the same semantic roles. In order to analyze user query and documents, semantic roles of each sentence are computed by passing these sentences through semantic role parser. This helps in categorizing the documents based on their semantic importance with user query. In iassist, NEC SENNA is used as the semantic role labeler, which is based on PropBank [4] semantic annotation. This semantic role labeler labels each verb in a sentence with its propositional arguments, and the labeling for each particular verb is called a frame. Therefore, for each sentence, the number of frames generated by the parser equals the number of verbs in the sentence. A set of abstract arguments given by the labeler indicates the semantic role of each term in a frame. In general, Arg[m] represents role of term in given sentence where m indicates argument number within sentence. For example, Arg0 is actor, Arg-NEG indicates negation. Figure 1 System Architecture V. SENTENCE-LEVEL SEMANTIC SIMILARITY COMPUTATION AND TOP RELEVANT DOCUMENT RANKING To assist users in finding answers relevant to their query, the retrieved documents from search engine are required to be ranked based on their semantic importance to the input user query. In order to rank these documents, the similarity scores between the retrieved documents and the input user query are computed. Simple keyword-based similarity measurement, such as the cosine similarity, cannot capture the semantic similarity. Thus, this system uses a method to calculate the semantic similarity between the sentences in retrieved documents from search engine and the user query based on the semantic role analysis. Along with this, the similarity computation uses WordNet in order to better capture the semantically related words. Table 1 Sentence-Level Semantic Similarity Calculation and Top Document Ranking Input : Sentences S i and S j Algorithm: 1. S i and S j are parsed into frames by the semantic role labeller. 2. For terms in frames having same semantic role, semantic similarity is computed using WordNet. This can be computed in two ways viz.. a. If two words are exactly equal in query and sentence and also having same semantic role, set term similarity equal to 1.

b. If two words are not equal, check semantic relation like synonyms by using WordNet hierarchy, if similar semantic meaning is found set term similarity equal to 1. c. If above two cases fails term similarity is set to 0. d. This can be represented mathematically as 3. Let, {r 1, r2,,r k }: Set of K common semantic roles between f 1 and f 2. Let T ( ) 1 r i represent set of terms related with frame f 1 and T ( ) 2 r i represent set of terms with frame f 2. represents role similarity between two term sets in two sentences, represents similarity between two terms in same role r i 4. Further this computation is used to compute similarity between two frames which in turn results in computation of sentence similarity. 5. The maximum of frame similarities between two sentences will be value of similarity between two sentences. This value lies in the interval 0 to 1. 6. The documents returned by the search engine for given query are ranked based on the document score calculated using following formulae: Where d n represents the n th retrieved document from search engine. Moreover, the list of the ranked documents is returned to the user as the search results. ) VI. DOCUMENT CLUSTERING USING MDL PRINCIPLE The identified top ranking cases are all relevant to the user query. But these relevant cases may actually belong to different categories. For example, if the user query is Give Information about Taj Mahal, the relevant cases may involve Taj Mahal as Tea Brand, Taj as Five Star Hotel or Taj Mahal as white Table 2 : Document Clustering Algorithm 1. Generate set of distinct keywords set for each document. 2. Calculate Support values for each keyword in distinct keyword set. Suppose that each document k is represented by a vector where is the support value of the keyword. 3. Decide support threshold value for all documents in document set. 4. Represent document set and Keywords in matrix form. Let is the term document matrix 5. It is assumed that C is the set of clusters, for document set D. Clustering information is represented using pair of matrices, and. M TC - term cluster matrix. M DC represents information with its member documents term document matrix M TD is represented using M TC and M DC. Where is and difference matrix with 0/1/-1 values. 6. Initially it is assumed that each document represents one cluster and agglomerative clustering algorithm is applied for document clustering. marble mausoleum etc. Therefore, it is necessary to further group these cases into different contexts. The proposed system makes use of Minimum Description Length(MDL) principle in order to cluster documents with similar meaning in one group. MDL Principle states that Best model inferred from a given data is the one which minimizes, length of the model in bits and the length of encoding of data, in bits. Table 3 : Procedures Algorithm AggloMDL (D) 1.Let C = c 1,c 2,c 3,..,c n, with c i = ({d i }) 2.Select best cluster pair (c i,c j ) from C for merging and form new cluster c k. 3.(c i,c j,c k ) := GetBestPair(C) 4.while(c i,c j,c k )is not empty do { 5. C:= C- {c i, c j } U {c k } 6.(c i,c j,c k ):=GetBestPair(C)} 7. return C End procedure GetBestPair(C) 1.MDLcostmin := 2.for each pair(c i,c j ) of clusters in C do{ 3.(MDLcost,ck):=GetMDLCost(c i,c j,c) /*GetMDLCost returns the optimal MDLCost when c k is made by merging c i and c j */ 4.if MDLcost<MDLcostmin then { 5.MDLcostmin :=MDLCost; )=(c i,c j,c k ) } } 7.return ); End procedure GetMDLCost(c i,c j,c) 1. Dk = D i D j ; 3. c k = (D k ); 4. C = C {c i,c j } {c k }; 5. MDL := Approximate MDL Cost of C by MDL COST Equation 6. return(mdl,c k ); End

MDL COST Equation Where are computed using M TD matrix VII. CLUSTERING USING SYMMETRIC NON- NEGATIVE MATRIX FACTORIZATION (SNMF) ALGORITHM W: Sentence similarity matrix where element represents similarity value between sentence pair S i and S j.h: Sentence cluster matrix. Initially, is set to 1 with size equal to size of W.The factorization problem can be stated as: Given a matrix W,find nonnegative matrices H and H T that minimize the function F(W,H) = Where, is the Frobenius norm or squared error norm. To derive the rule for updation of H with H 0, we use Karush-Kuhn-Tucker (KKT) condition leading to fixed point relation. T ( 4W H + 4 H H H ) H = 0 If the above condition is true, update the value of H using following equation 1 ( WH ) H H (1 + T 2 ( HH H ) Hence, the algorithm procedure for solving SNMF is: given an initial guess of H (identity matrix in this case), iteratively update H using above equation until convergence (update H till it satisfies KKT condition). By this continuous updation, matrix H finally has clusters of sentences. As SNMF maintains near-orthogonality of columns in H, it is useful in data (sentence) clustering. This results in softclustering where an object can belong to more than one clusters. VIII. SUMMARIZATION OF EACH SENTENCE CLUSTER Table 4 : Within cluster-sentence selection After grouping the sentences into clusters by the SNMF algorithm, 1. Remove the noisy clusters (the cluster of sentences containing less than three sentences). 2. Then, in each sentence cluster, rank the sentences based on the sentence score calculation, as shown in following equations. The score of a sentence measures the importance of a sentence to be included in the final concise solution (summary). Internal Similarity Measure : External Similarity Measure Where F 1 (S i ) measures the average similarity score between sentence S i and all other sentences in cluster C k, and N is the number of sentences in C k. F 2 (S i ) represents similarity between sentence S i and input request. (weight parameter) is set to 0.7 by trial and error. High value of indicates more weightage is given to internal similarity. IX. RESULT ANALYSIS All experiments reported here were performed on Intel CORE i3processor with 4GB RAM. All algorithms are implemented using JAVA as the programming platform. Implemented Algorithms 1) Sentence Level Semantic Similarity Calculation and Top Ranking Cases - Making use of NEC SENNA and WordNet 2) Clustering of Top Ranking Documents using MDL Principle 3) Sentence Clustering using SNMF. 4) Multi Document summarization - within cluster sentence selection In the set of experiments, we randomly select questions from different context and search result returned by the search engine. During user survey, user is asked to manually generate solution for the the selected queries. The sentences in the solution are considered as relevant sentence set. Then we compare the solution generated by iassit with standard automated summarization tool. Table 6 shows solution generated by the user and iassit. Performance of iassist is measured using standard IR measures: precision and recall Where, S man : Set of sentences selected by manual evaluation S sys : Set of sentences selected by iassist or automated summarization tool in final summary. Table 5 shows precision and recall values for sample user queries. Figure 2 and 3 show the average precision and recall of the two techniques. The higher precision value of iassist as compared to automated summarization tools demonstrates that the semantic similarity calculation can better capture the meanings of the user requests and case documents returned by the search engine. Comparison of proposed iassist system with current helpdesk systems is shown in Table 7. We observe that the user satisfaction can be improved by capturing semantically related cases as compared to only keyword-based matching cases. From the values of recall and precision obtained for sample scenarios, we conclude that combining the MDL principle that groups documents according to different contexts and the SNMF clustering algorithm can help users to easily find their desired solutions from multiple physical pages. The problem of maintaining an up-to-date history of past cases is solved by making use of search engine as a database. Also, user can query any problem related to any domain. X. CONCLUSION The proposed iassist system provides its users a single point of access to their problems by providing solutions from different domains. This system will automatically find problem-solution pattern for new request given by user by making use of search results returned by the search engine. Use of semantic case ranking, MDL clustering and SNMF with request-focused multi document summarization helps to improve the performance of iassist. The proposed approach of semantic role labeling contributes in improving the overall result of summarization. As the proposed system uses search engine results as case history for the user query, the problem of maintaining an

updated case history for each and every problem is automatically resolved. Figure 2 Precision of Retrieved Cases Figure 3 Recall of Retrieved Cases Table 5 : Performance Analysis Table 7 : Comparison Of iassist with Current Helpdesk Systems Table 6: Top-Ranking Summary Sample By Manual Evaluation And iassist For Sample Scenario REFERENCES [1] Chulyun Kim and Kyuseok Shim, Member, IEEE Transactions, "TEXT: Automatic Template Extraction from Heterogeneous Web Pages, Vol.23, NO.4, April 2011. [2] D.Wang, T. Li, S. Zhu, and Y. Gong, ihelp: An Intelligent Online Helpdesk System IEEE Transactions On Systems, Man, And Cybernetics Part B: Cybernetics, Vol. 41, No. 1, February 2011. [3] D.Wang, S. Zhu, T. Li, and C. Ding, Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization, in Proc. SIGIR, 2008, pp. 307 314. [4] M. Palmer, P. Kingsbury, and D. Gildea, The proposition bank: An annotated corpus of semantic roles, Comput. Linguist, vol.31, no. 1, pp. 71 106, Mar. 2005. [5] D. Bridge, M. H. Goker, L. Mcginty, and B. Smyth, Case-based recommender systems, Knowl. Eng. Rev., vol. 20, no. 3, pp. 315 320, Sep. 2005. [6] R. Agrawal, R. Rantzau, and E. Terzi, Contextsensitive ranking, in Proc. SIGMOD, 2006, pp. 383 394. [7] Leuski and J. Allan, Improving interactive retrieval by combining ranked list and clustering, in Proc. RIAO, 2000, pp. 665 681.