Search Engine Based Intelligent Help Desk System: iassist

Search Engine Based Intelligent Help Desk System: iassist Sahil K. Shah, Prof. Sheetal A. Takale Information Technology Department VPCOE, Baramati, Maharashtra, India sahilshahwnr@gmail.com, sheetaltakale@gmail.com Abstract: Intelligent Help Desk System is the need of every individual. Many organizations use Case Based Help Desk System. Maintaining an up-to-date Case History for each and every problem is difficult and costly. Search Engine is doing the task of intelligent help for all the users of internet. For a given user keyword query, current web search engines return a list of individual web pages. However, information for the query is often spread across multiple pages. This degrades the quality of search results. To address these challenges, an Intelligent Help Desk System: iassist is developed. It is based on search engine results as case history of the user query. The semantic relevance of the search results with the user query is computed using NEC SENNA and WordNet. The proposed system ranks the search results based on their semantic relevance to the request. These relevant results are grouped into different clusters based on MDL principle and symmetric matrix factorization. Each cluster is summarized to generate recommended solutions. For performance analysis the system is tested using user survey. Experiments conducted demonstrate the effectiveness of iassist in semantic text understanding, document clustering and summarization. The better performance of iassist benefits from the sentence level semantic analysis, clustering using MDL principle and SNMF. Keywords: Intelligent Helpdesk, Semantic Similarity, Web Search Result Summarization, Document Summarization I. INTRODUCTION Intelligent Help Desk System is the need of every individual. Many organizations use Case Based Help Desk System to improve the quality of customer service. For a given customer request, an intelligent helpdesk system tries to find the earlier similar requests and the case history associated with the request. Helpdesk systems usually use databases to store past interactions between customers and companies. Interactions may be descriptions of a problem and recommended solutions. Major challenge faced by these help desk systems is maintenance of up-to-date case history. Maintaining an up-to-date Case History for each and every problem is difficult and costly. Search Engine is doing the task of intelligent help for all the users of internet. However, content on the Web and Enterprise Intranet is increasing day by day. The web is a vast collection of completely uncontrolled heterogeneous documents. It is huge, Diverse, and dynamic. For a user keyword query, current Web Search Engines return a list of pages with respect to the query. However, the information for a topic, especially for multitopic queries in which individual query keywords occur relatively frequently in the document collection but rarely occur together in the same document, is often distributed among multiple physical pages. So the search engines are drowning in information, but starving for knowledge. To address the challenges faced by present help desk system and web search engines, we have developed an online helpdesk system: iassist. It automatically finds problem-solution pattern from web using search engines like Google, Yahoo, etc. For a given user query, iassist interacts with the search engine to retrieve the relevant solutions. These retrieved solutions are ranked based on their semantic similarity with user query. Semantic similarity is based on semantic roles and semantic meanings. II. LITERATURE SURVEY Case-based systems have been developed to interactively search the solution space by suggesting the most informative questions to ask [2,5]. These systems use the initial information to retrieve the first candidate set and then ask the user questions to narrow down until few cases remain or the most suitable items are found. When the description of cases or items becomes complicated, these case-based systems suffer from the curse of dimensionality, and the similarity/ distance between cases or items becomes difficult to measure. Furthermore, the similarity measurements used in these systems usually are based on keyword matching, which lacks the semantic analysis of customer requests and existing cases. Help desk systems based on database search and ranking are also developed. Many methods have been proposed to perform similarity search and rank results of a query [6]. However, similar to the case-based systems, the similarity is measured based on keyword matching, which have difficulty to understand the semantics and context of text deeply. Existing search engines often return a long list of search results. Clustering technologies are often used in search result organization [7]. However, the existing document-clustering algorithms do not consider the impact of the general and common information contained in the documents. In our work, by filtering out this common information, the clustering quality can be improved, and better context organizations can then be obtained. III. SYSTEM ARCHITECTURE Figure 1 shows system architecture of iassist. System works in five modules: Preprocessing, Case Ranking, Document Clustering, Sentence Clustering and Sentence Cluster Summarization. As shown in figure, input to the system is user query in the form of question. The

system retrieves relevant solutions or past cases from search engine. Pre-processing of user query and past cases involves removal of non-words, then each of the retrieved document is truncated into sentences and passed through semantic role parser for semantic role labeling. Case ranking module ranks the retrieved documents based on their sentence level semantic similarity with user query. Semantically ranked documents need to be grouped according the context. Top ranking documents are clustered using Minimum Description Length (MDL) principle [1]. Sentence Clustering Module groups sentences having similar meaning into a cluster using Symmetric Non-negative Matrix Factorization (SNMF) [3]. Sentence Cluster Summarization module selects most relevant sentences from each cluster in order to form a concise summary which is represented as reference solution to the user. IV. PREPROCESSING It is essential to consider only meaningful words and to remove the redundancy in documents as well as to reduce the document size. So, preprocessing of problemsolution pattern involves removal of non-words from both the user query and documents retrieved from search engine. Further, each sentence in the retrieved document is passed to a semantic role parser to find semantic meaning of each sentence based on frames (or verbs) in a sentence. Semantic role labeling Semantic role labeling, sometimes also called shallow semantic parsing, is a task in natural language processing consisting of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles. A semantic role is a description of the relationship that a constituent plays with respect to the verb in the sentence. For example, given a sentence like Riya sold the book to Abbas", the task would be to recognize the verb "to sell" as representing the predicate, "Riya" as representing the seller (agent), "the book" as representing the goods (theme), and "Abbas" as representing the recipient. This is an important step towards making sense of the meaning of a sentence. A semantic representation of this sort is at a higher-level of abstraction than a syntax tree. For instance, the sentence "The book was sold by Riya to Abbas" has a different syntactic form, but the same semantic roles. In order to analyze user query and documents, semantic roles of each sentence are computed by passing these sentences through semantic role parser. This helps in categorizing the documents based on their semantic importance with user query. In iassist, NEC SENNA is used as the semantic role labeler, which is based on PropBank [4] semantic annotation. This semantic role labeler labels each verb in a sentence with its propositional arguments, and the labeling for each particular verb is called a frame. Therefore, for each sentence, the number of frames generated by the parser equals the number of verbs in the sentence. A set of abstract arguments given by the labeler indicates the semantic role of each term in a frame. In general, Arg[m] represents role of term in given sentence where m indicates argument number within sentence. For example, Arg0 is actor, Arg-NEG indicates negation. Figure 1 System Architecture V. SENTENCE-LEVEL SEMANTIC SIMILARITY COMPUTATION AND TOP RELEVANT DOCUMENT RANKING To assist users in finding answers relevant to their query, the retrieved documents from search engine are required to be ranked based on their semantic importance to the input user query. In order to rank these documents, the similarity scores between the retrieved documents and the input user query are computed. Simple keyword-based similarity measurement, such as the cosine similarity, cannot capture the semantic similarity. Thus, this system uses a method to calculate the semantic similarity between the sentences in retrieved documents from search engine and the user query based on the semantic role analysis. Along with this, the similarity computation uses WordNet in order to better capture the semantically related words. Table 1 Sentence-Level Semantic Similarity Calculation and Top Document Ranking Input : Sentences S i and S j Algorithm: 1. S i and S j are parsed into frames by the semantic role labeller. 2. For terms in frames having same semantic role, semantic similarity is computed using WordNet. This can be computed in two ways viz.. a. If two words are exactly equal in query and sentence and also having same semantic role, set term similarity equal to 1.

b. If two words are not equal, check semantic relation like synonyms by using WordNet hierarchy, if similar semantic meaning is found set term similarity equal to 1. c. If above two cases fails term similarity is set to 0. d. This can be represented mathematically as 3. Let, {r 1, r2,,r k }: Set of K common semantic roles between f 1 and f 2. Let T ( ) 1 r i represent set of terms related with frame f 1 and T ( ) 2 r i represent set of terms with frame f 2. represents role similarity between two term sets in two sentences, represents similarity between two terms in same role r i 4. Further this computation is used to compute similarity between two frames which in turn results in computation of sentence similarity. 5. The maximum of frame similarities between two sentences will be value of similarity between two sentences. This value lies in the interval 0 to 1. 6. The documents returned by the search engine for given query are ranked based on the document score calculated using following formulae: Where d n represents the n th retrieved document from search engine. Moreover, the list of the ranked documents is returned to the user as the search results. ) VI. DOCUMENT CLUSTERING USING MDL PRINCIPLE The identified top ranking cases are all relevant to the user query. But these relevant cases may actually belong to different categories. For example, if the user query is Give Information about Taj Mahal, the relevant cases may involve Taj Mahal as Tea Brand, Taj as Five Star Hotel or Taj Mahal as white Table 2 : Document Clustering Algorithm 1. Generate set of distinct keywords set for each document. 2. Calculate Support values for each keyword in distinct keyword set. Suppose that each document k is represented by a vector where is the support value of the keyword. 3. Decide support threshold value for all documents in document set. 4. Represent document set and Keywords in matrix form. Let is the term document matrix 5. It is assumed that C is the set of clusters, for document set D. Clustering information is represented using pair of matrices, and. M TC - term cluster matrix. M DC represents information with its member documents term document matrix M TD is represented using M TC and M DC. Where is and difference matrix with 0/1/-1 values. 6. Initially it is assumed that each document represents one cluster and agglomerative clustering algorithm is applied for document clustering. marble mausoleum etc. Therefore, it is necessary to further group these cases into different contexts. The proposed system makes use of Minimum Description Length(MDL) principle in order to cluster documents with similar meaning in one group. MDL Principle states that Best model inferred from a given data is the one which minimizes, length of the model in bits and the length of encoding of data, in bits. Table 3 : Procedures Algorithm AggloMDL (D) 1.Let C = c 1,c 2,c 3,..,c n, with c i = ({d i }) 2.Select best cluster pair (c i,c j ) from C for merging and form new cluster c k. 3.(c i,c j,c k ) := GetBestPair(C) 4.while(c i,c j,c k )is not empty do { 5. C:= C- {c i, c j } U {c k } 6.(c i,c j,c k ):=GetBestPair(C)} 7. return C End procedure GetBestPair(C) 1.MDLcostmin := 2.for each pair(c i,c j ) of clusters in C do{ 3.(MDLcost,ck):=GetMDLCost(c i,c j,c) /*GetMDLCost returns the optimal MDLCost when c k is made by merging c i and c j */ 4.if MDLcost<MDLcostmin then { 5.MDLcostmin :=MDLCost; )=(c i,c j,c k ) } } 7.return ); End procedure GetMDLCost(c i,c j,c) 1. Dk = D i D j ; 3. c k = (D k ); 4. C = C {c i,c j } {c k }; 5. MDL := Approximate MDL Cost of C by MDL COST Equation 6. return(mdl,c k ); End

MDL COST Equation Where are computed using M TD matrix VII. CLUSTERING USING SYMMETRIC NON- NEGATIVE MATRIX FACTORIZATION (SNMF) ALGORITHM W: Sentence similarity matrix where element represents similarity value between sentence pair S i and S j.h: Sentence cluster matrix. Initially, is set to 1 with size equal to size of W.The factorization problem can be stated as: Given a matrix W,find nonnegative matrices H and H T that minimize the function F(W,H) = Where, is the Frobenius norm or squared error norm. To derive the rule for updation of H with H 0, we use Karush-Kuhn-Tucker (KKT) condition leading to fixed point relation. T ( 4W H + 4 H H H ) H = 0 If the above condition is true, update the value of H using following equation 1 ( WH ) H H (1 + T 2 ( HH H ) Hence, the algorithm procedure for solving SNMF is: given an initial guess of H (identity matrix in this case), iteratively update H using above equation until convergence (update H till it satisfies KKT condition). By this continuous updation, matrix H finally has clusters of sentences. As SNMF maintains near-orthogonality of columns in H, it is useful in data (sentence) clustering. This results in softclustering where an object can belong to more than one clusters. VIII. SUMMARIZATION OF EACH SENTENCE CLUSTER Table 4 : Within cluster-sentence selection After grouping the sentences into clusters by the SNMF algorithm, 1. Remove the noisy clusters (the cluster of sentences containing less than three sentences). 2. Then, in each sentence cluster, rank the sentences based on the sentence score calculation, as shown in following equations. The score of a sentence measures the importance of a sentence to be included in the final concise solution (summary). Internal Similarity Measure : External Similarity Measure Where F 1 (S i ) measures the average similarity score between sentence S i and all other sentences in cluster C k, and N is the number of sentences in C k. F 2 (S i ) represents similarity between sentence S i and input request. (weight parameter) is set to 0.7 by trial and error. High value of indicates more weightage is given to internal similarity. IX. RESULT ANALYSIS All experiments reported here were performed on Intel CORE i3processor with 4GB RAM. All algorithms are implemented using JAVA as the programming platform. Implemented Algorithms 1) Sentence Level Semantic Similarity Calculation and Top Ranking Cases - Making use of NEC SENNA and WordNet 2) Clustering of Top Ranking Documents using MDL Principle 3) Sentence Clustering using SNMF. 4) Multi Document summarization - within cluster sentence selection In the set of experiments, we randomly select questions from different context and search result returned by the search engine. During user survey, user is asked to manually generate solution for the the selected queries. The sentences in the solution are considered as relevant sentence set. Then we compare the solution generated by iassit with standard automated summarization tool. Table 6 shows solution generated by the user and iassit. Performance of iassist is measured using standard IR measures: precision and recall Where, S man : Set of sentences selected by manual evaluation S sys : Set of sentences selected by iassist or automated summarization tool in final summary. Table 5 shows precision and recall values for sample user queries. Figure 2 and 3 show the average precision and recall of the two techniques. The higher precision value of iassist as compared to automated summarization tools demonstrates that the semantic similarity calculation can better capture the meanings of the user requests and case documents returned by the search engine. Comparison of proposed iassist system with current helpdesk systems is shown in Table 7. We observe that the user satisfaction can be improved by capturing semantically related cases as compared to only keyword-based matching cases. From the values of recall and precision obtained for sample scenarios, we conclude that combining the MDL principle that groups documents according to different contexts and the SNMF clustering algorithm can help users to easily find their desired solutions from multiple physical pages. The problem of maintaining an up-to-date history of past cases is solved by making use of search engine as a database. Also, user can query any problem related to any domain. X. CONCLUSION The proposed iassist system provides its users a single point of access to their problems by providing solutions from different domains. This system will automatically find problem-solution pattern for new request given by user by making use of search results returned by the search engine. Use of semantic case ranking, MDL clustering and SNMF with request-focused multi document summarization helps to improve the performance of iassist. The proposed approach of semantic role labeling contributes in improving the overall result of summarization. As the proposed system uses search engine results as case history for the user query, the problem of maintaining an

updated case history for each and every problem is automatically resolved. Figure 2 Precision of Retrieved Cases Figure 3 Recall of Retrieved Cases Table 5 : Performance Analysis Table 7 : Comparison Of iassist with Current Helpdesk Systems Table 6: Top-Ranking Summary Sample By Manual Evaluation And iassist For Sample Scenario REFERENCES [1] Chulyun Kim and Kyuseok Shim, Member, IEEE Transactions, "TEXT: Automatic Template Extraction from Heterogeneous Web Pages, Vol.23, NO.4, April 2011. [2] D.Wang, T. Li, S. Zhu, and Y. Gong, ihelp: An Intelligent Online Helpdesk System IEEE Transactions On Systems, Man, And Cybernetics Part B: Cybernetics, Vol. 41, No. 1, February 2011. [3] D.Wang, S. Zhu, T. Li, and C. Ding, Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization, in Proc. SIGIR, 2008, pp. 307 314. [4] M. Palmer, P. Kingsbury, and D. Gildea, The proposition bank: An annotated corpus of semantic roles, Comput. Linguist, vol.31, no. 1, pp. 71 106, Mar. 2005. [5] D. Bridge, M. H. Goker, L. Mcginty, and B. Smyth, Case-based recommender systems, Knowl. Eng. Rev., vol. 20, no. 3, pp. 315 320, Sep. 2005. [6] R. Agrawal, R. Rantzau, and E. Terzi, Contextsensitive ranking, in Proc. SIGMOD, 2006, pp. 383 394. [7] Leuski and J. Allan, Improving interactive retrieval by combining ranked list and clustering, in Proc. RIAO, 2000, pp. 665 681.