Interactive Chinese Question Answering System in Medicine Diagnosis



Similar documents
Searching Questions by Identifying Question Topic and Question Focus

Subordinating to the Majority: Factoid Question Answering over CQA Sites

TREC 2003 Question Answering Track at CAS-ICT

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

How To Cluster On A Search Engine

Comparing IPL2 and Yahoo! Answers: A Case Study of Digital Reference and Community Based Question Answering

Term extraction for user profiling: evaluation by the user

Incorporating Participant Reputation in Community-driven Question Answering Systems

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

On the Feasibility of Answer Suggestion for Advice-seeking Community Questions about Government Services

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Search and Information Retrieval

Semantic Search in Portals using Ontologies

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Finding the Right Facts in the Crowd: Factoid Question Answering over Social Media

Question Routing by Modeling User Expertise and Activity in cqa services

Quality-Aware Collaborative Question Answering: Methods and Evaluation

SEARCHING QUESTION AND ANSWER ARCHIVES

Finding Expert Users in Community Question Answering

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Understanding and Summarizing Answers in Community-Based Question Answering Services

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

Domain Classification of Technical Terms Using the Web

Research of Postal Data mining system based on big data

Joint Relevance and Answer Quality Learning for Question Routing in Community QA

Teaching in School of Electronic, Information and Electrical Engineering

How To Write A Summary Of A Review

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Natural Language to Relational Query by Using Parsing Compiler

PULLING OUT OPINION TARGETS AND OPINION WORDS FROM REVIEWS BASED ON THE WORD ALIGNMENT MODEL AND USING TOPICAL WORD TRIGGER MODEL

Identifying Best Bet Web Search Results by Mining Past User Behavior

A Framework of User-Driven Data Analytics in the Cloud for Course Management

The Application Research of Ant Colony Algorithm in Search Engine Jian Lan Liu1, a, Li Zhu2,b

Removing Web Spam Links from Search Engine Results

Analysis of Social Media Streams

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Importance of Domain Knowledge in Web Recommender Systems

Incorporate Credibility into Context for the Best Social Media Answers

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

The Impact of Query Suggestion in E-Commerce Websites

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Query term suggestion in academic search

Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives

A Rule-Based Short Query Intent Identification System

Facilitating Business Process Discovery using Analysis

A Comparative Approach to Search Engine Ranking Strategies

A Survey on Product Aspect Ranking

Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

Link Analysis and Site Structure in Information Retrieval

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

A Survey on Product Aspect Ranking Techniques

The Enron Corpus: A New Dataset for Classification Research

Improving Question Retrieval in Community Question Answering Using World Knowledge

Discovering and Querying Hybrid Linked Data

Building a Question Classifier for a TREC-Style Question Answering System

Data-Intensive Question Answering

Dynamical Clustering of Personalized Web Search Results

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search

Identifying Focus, Techniques and Domain of Scientific Papers

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Experiments in Web Page Classification for Semantic Web

Graph Mining and Social Network Analysis

RETRIEVING QUESTIONS AND ANSWERS IN COMMUNITY-BASED QUESTION ANSWERING SERVICES KAI WANG

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

Online Marketing Optimization Essentials

Micro blogs Oriented Word Segmentation System

NTT DOCOMO Technical Journal. Knowledge Q&A: Direct Answers to Natural Questions. 1. Introduction. 2. Overview of Knowledge Q&A Service

Data Mining Yelp Data - Predicting rating stars from review text

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

SEO Techniques for various Applications - A Comparative Analyses and Evaluation

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Personalizing Image Search from the Photo Sharing Websites

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Query Recommendation employing Query Logs in Search Optimization

Framework for Intelligent Crawler Engine on IaaS Cloud Service Model

Ming-Wei Chang. Machine learning and its applications to natural language processing, information retrieval and data mining.

Will my Question be Answered? Predicting Question Answerability in Community Question-Answering Sites

Transcription:

Interactive Chinese ing System in Medicine Diagnosis Xipeng Qiu School of Computer Science Fudan University xpqiu@fudan.edu.cn Jiatuo Xu Shanghai University of Traditional Chinese Medicine xjt@fudan.edu.cn Abstract In this paper, we propose a general framework for the interactive question answering system in medical diagnosis, which can interact simply with user to get more refined question descriptions and return answers. The system first gets FAQ pairs from cqa website, and builds the medical ontology with incremental methods. Then it analyzes the question and enquires user for the lacking information. After getting user s feedbacks, it performs question retrieval and extracts answers. The experiment shows our system has better performance with user s feedbacks. 1. Introduction Automatic question answering (QA) is an important research topic in information retrieval and natural language processing fields [39, 40], which is an alternative to the keyword based information retrieval system, like Google 1, Baidu 2. The input of a QA system is a question, and the output is the corresponding answers extracted from a large corpus or web[20]. However, these system cannot deal with some complicated questions which are related with domain knowledge, such as medical domain. Fig. 1 shows the general framework of question answering in the open domain. To alleviate this problem, we can resort to online large scale FAQ archive for specific domains. In recent years, the community-based question answering services (cqa) have become very popular, such as Baidu Zhidao 3. Instead of finding answer by forums or search engines, users can post their question on cqa websites and wait the other people to answer it. While forums focus on the discussion and communication between users, cqa services focus on answering the questions of users. Therefore, users can get a faster response in cqa websites. These cqa websites also provide an interface to retrieval the answered questions, which 1 http://www.google.com 2 http://www.baidu.com 3 http://zhidao.baidu.com are almost based on keyword search engine. So it is not still enough to offer the exact information to user. The user also need consider some appropriate keywords to represent his needs. Besides, the good answers are often mingle with large of bad or wrong answers. Therefore, the major issue is to find the exact one when the answers of many complicated question already exist. There are some related works, including question suggestion, answers qualities, question answering pairs extraction, etc[14, 18, 23, 26, 32, 9, 27]. In this paper, we propose a general framework for the interactive chinese question answering system in medical diagnosis, which can interact simply with user to get refined question descriptions and return user the extracted answers. The system first gets FAQ pairs from cqa website, and collects the medical ontology with incremental method. Then it analyzes the question and enquires user about the lacking information. After getting user s feedbacks, it performs question retrieval and extracts answers. In the rest of the paper, we first describe our system in section 2, and evaluate it by the experiments in section 3. Finally, we give the conclusions in section 4. 2. System Framework In this section, we introduce our system for the interactive chinese question answering system in medical diagnosis. 2.1. Topical Crawler Topical crawlers play an important role in domain search engines. Topical crawlers can start with some seed keywords or urls and gather the web pages which have similar content with seeds [35] [28, 5]. The context is one of the most useful features, which can guide crawler to locate highly relevant target pages. In our system, we collect medical webpages by analyzing the anchor text attached to hyperlinks. We first collect the anchor texts with the corresponding categories from

Query Generation Classification Semantic Web Retrieval Ranking Extraction WWW Figure 1. The flowchart of the open domain question answering system two chinese cqa websites, which provide the question categories. Then we select the anchor text with categories related to the medical keywords, such as 医 疗 / 疾 病. Then we build a two-class classifier to classify the anchor texts to medical or non-medical texts. The classifier we used is naive Bayes with multinomial distribution[17]. 2.2. Medical Ontology Construction To take advantage of the medical domain knowledges, we need to establish the ontology about the medical terms, concepts, entities and their relations. Due to the difficulty in collect knowledges manually, we use an automatic method to collect them. There are already some works to extract information within the collected corpus automatically[7, 37]. The objective of information extraction (IE) is to extract certain pieces of information from text that are related to a prescribed set of related concepts. We first collect some initial information, which includes names of drug, symptom, disease and the relations between them manually. Then we build the medical ontology with information extraction methods. 2.3. QA Pairs Extraction Since there are many methods to extract the best answers for a question in cqa websites [26, 15]. The answer quality problem is important when there are many duplicated questions, or wrong questions. These questions have answers with varying quality levels, therefore it is not enough to measure relevance alone and the quality of answers must be considered together. We use the features to predicate to exact the best answer, which are described in [15]. These features includes: er s Acceptance Ratio, Length, er s Self Evaluation, er s Activity Level, er s Category Specialty, Users Recommendation, Number of s. 2.4. In the general question analysis system, the first step is question classification[29, 24, 10, 44]. The categories is consistent with entity extraction in the latter steps. However, there are some difficulties in chinese medical QA system. First, it is very different between English and Chinese question sentence. Second, most questions are not factbased and are complex to be categoried. In our system we build an the question analysis model with medical ontology[13]. It first analyzes the focus words in the question, and finds the related concepts in medical ontology. Then it classifies the question to a category and decides what is missing information for the question. 2.5. Interactive Feedback A user often input a question with just mainly symptom, but it is not often enough to get cause of disease. For example, 有 什 么 方 法 能 治 疗 头 晕?There are many reasons to lead to dizziness, and the corresponding treatments vary greatly with different reasons or state of health. To get the exact answers, the user are asked to provide some extra information, such as his age, other symptoms, etc. With the user provided symptoms, the system firstly gets the related symptoms from the collected medical knowledge. Then the system interact with user to ensure all signs of his disease.

Ranking Feedback Lacking Information Type Focus s Filtering Candidates Extraction Auxiliary Information QA Pairs Medical Ontology WWW cqa Websties Topic Web Crawler Figure 2. The flowchart of the interactive chinese question answering system There are also some researches on interactive question answering[11, 12, 25]. 2.6. FAQ Retrieval Giving a FAQ corpus, there is still a problem to retrieve useful information for the user s questions. There are many works to improve the performance of FAQ retrieval[41, 22, 2, 19, 4, 3, 14, 16, 43, 4]. An importance problem is how to calculate the similarity between user s question and a FAQ pair, which requires some semantic analysis. However, measuring semantic similarities between questions is not trivial. Sometimes, two questions that have the same meaning use very different wording. For example, Q1: 糖 尿 病 患 者 长 期 服 用 什 么 药 比 较 有 效, 副 作 用 比 较 小? and Q2: 有 什 么 能 有 效 降 低 血 糖 并 且 对 身 体 无 害 的 药? have almost the identical meaning but they are lexically very different. Similarity measures developed for document retrieval work poorly when there is little word overlap. Thus, if there is the QA pair of Q2 in FAQ corpus, but the user ask the question Q1. Then, he could not get answer because Q1 and Q2 are almost different with traditional information retrieval method. A solution for this issue is query expansion[31, 38, 42]. In our system, we expand the query by the domain ontology. For the name of disease, we add some keywords about its corresponding symptoms. 2.7. Extraction In cqa websites, the repliers often provide background or related informations for the questions, which are useful to help questioner to find out the fact himself. But sometimes, especially for the factoid and list questions, the user need the exact answers instead of the related pieces of answers. For example, 请 问 糖 尿 病 的 症 状 有 哪 些?. So we need extract the answers from the related informations[8, 33, 34, 45]. We first extract the entities from these informations, and classify them to the different entity categories, such as Person, Location, Organization, Durations, Quantities and Dates, etc[1]. Then we score the entities and filtering them with a threshold. Entity scores have two components. The first component is whether or not the entity s category matches the query s

category. The second component of the entity score is based on the frequency and position of occurrences of a given entity within the retrieved passages[1]. In our system, we use conditional random fields [21, 30] to label the entities and its corresponding categories. 2.8. Re-ranking Before return answers to user, the system need re-rank the answers to improve the system performance. For example, removing redundancy answers [6]. We can use more features to [36] to judge the scores for each answers candidates. 3. Experiments We implement our system and collect about 84,000 QA pairs in medical domain from cqa websites: Baidu Zhidao 4, WenWen 5. We evaluate our results with mean precision at rank 1 (P@1), which is the percentage of questions with the correct answer on the first position. We use the keywords query as the baseline system. These keywords are just the terms in question. We select randomly 100 questions and evaluate the qualities of answers manually. Table 1. Results of different systems with P@1 Systems P@1 Baseline 79% No feedback 82% Feedback 87% Table 1 shows the results of our system. The feedback of user can improve the answer quality greatly. 4. Conclusion In this paper, we propose a framework of the interactive question system in medical domain. It integrates the question analysis, query expansion, ontology construction, answer extraction and answer ranking. We also address the difficulties in each part and the preliminary solutions. The proposed framework is also applied for the other domain, such as music, travel. 4 http://zhidao.baidu.com 5 http://wenwen.soso.com/ 5. Acknowledgement This work was supported by the National High Technology Research and Development Program of China (863 Program)(No.2007AA02Z429, the Natural Science Foundation of China (No.30300443 and 60435020). References [1] S. Abney, M. Collins, and A. Singhal. extraction. Proceedings of the sixth conference on Applied natural language processing, pages 296 301, 2000. [2] R. Baeza-Yates, B. Ribeiro-Neto, et al. Modern information retrieval. Addison-Wesley Harlow, England, 1999. [3] R. Burke, K. Hammond, and J. Kozlovsky. Knowledgebased information retrieval from semi-structured text. AAAI Fall Symposium on AI Applications in Knowledge Navigation and Retrieval, pages 19 24, 1995. [4] R. Burke, K. Hammond, V. Kulyukin, S. Lytinen, N. Tomuro, and S. Schoenberg. answering from frequently asked question files: Experiences with the faq finder system. AI Magazine, 18(2):57 66, 1997. [5] S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. Proceedings of the 11th international conference on World Wide Web, pages 148 159, 2002. [6] C. Clarke, G. Cormack, and T. Lynam. Exploiting redundancy in question answering. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 358 365, 2001. [7] J. Cowie and W. Lehnert. Information extraction. Communications of the ACM, 39(1):80 91, 1996. [8] D. Demner-Fushman and J. Lin. Knowledge extraction for clinical question answering: Preliminary results. Proceedings of the AAAI-05 Workshop on ing in Restricted Domains, pages 9 13, 2005. [9] S. Ding, G. Cong, C.-Y. Lin, and X. Zhu. Using conditional random fields to extract contexts and answers of questions from online forums. In Proceedings of ACL-08: HLT, pages 710 718, Columbus, Ohio, June 2008. Association for Computational Linguistics. [10] J. Ely, J. Osheroff, P. Gorman, M. Ebell, M. Chambliss, E. Pifer, and P. Stavri. A taxonomy of generic clinical questions: classification study, 2000. [11] T. Hao, D. Hu, L. Wenyin, and Q. Zeng. Semantic patterns for user-interactive question answering. CONCURRENCY AND COMPUTATION, 20(7):783, 2008. [12] S. Harabagiu, A. Hickl, J. Lehmann, and D. Moldovan. Experiments with interactive question-answering. Ann Arbor, 100, 2005. [13] U. Hermjakob. Parsing and question classification for question answering. Proceedings of the Workshop on ing at the Conference ACL-2001, 2001. [14] J. Jeon, W. Croft, and J. Lee. Finding similar questions in large question and answer archives. Proceedings of the 14th ACM international conference on Information and knowledge management, pages 84 90, 2005.

[15] J. Jeon, W. Croft, J. Lee, and S. Park. A framework to predict the quality of answers with non-textual features. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 228 235, 2006. [16] V. Jijkoun and M. de Rijke. Retrieving answers from frequently asked questions pages on the web. Proceedings of the 14th ACM international conference on Information and knowledge management, pages 76 83, 2005. [17] M. Jordan. Learning in Graphical Models. Kluwer Academic Publishers, 1998. [18] P. Jurczyk and E. Agichtein. Discovering authorities in question answer communities by using link analysis. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 919 922, 2007. [19] H. Kim and J. Seo. High-performance faq retrieval using an automatic clustering method of query logs. Information Processing and Management, 42(3):650 661, 2006. [20] C. Kwok, O. Etzioni, and D. Weld. Scaling question answering to the web. Proceedings of the 10th international conference on World Wide Web, pages 150 161, 2001. [21] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML 01: Proceedings of the Eighteenth International Conference on Machine Learning, pages 282 289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [22] C. Lee. Intention Extraction and Semantic Matching for Internet FAQ Retrieval. PhD thesis, Master Thesis, Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan, ROC, 2000. [23] C. LENGELER, D. SAVIGNY, H. MSHINDA, C. MAY- OMBANA, S. TAYARI, C. HATZ, A. DEGRÉMONT, and M. TANNER. Community-based questionnaires and health statistics as tools for the cost-efficient identification of communities at risk of urinary schistosomiasis. International Journal of Epidemiology, 20(3):796 807, 1991. [24] X. Li and D. Roth. Learning question classifiers. Proceedings of the 19th International Conference on Computational Linguistics, pages 556 562, 2002. [25] J. Lin, D. Quan, V. Sinha, K. Bakshi, D. Huynh, B. Katz, and D. Karger. What makes a good answer? the role of context in question answering. Human-Computer Interaction, 2003. [26] X. Liu, W. Croft, and M. Koll. Finding experts in community-based question-answering services. In Proceedings of the 14th ACM international conference on Information and knowledge management, pages 315 316. ACM New York, NY, USA, 2005. [27] Y. Liu and E. Agichtein. You ve got answers: Towards personalized models for predicting success in community question answering. In Proceedings of ACL-08: HLT, Short Papers, pages 97 100, Columbus, Ohio, June 2008. Association for Computational Linguistics. [28] F. Menczer, G. Pant, P. Srinivasan, and M. Ruiz. Evaluating topic-driven web crawlers. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 241 249, 2001. [29] D. Metzler and W. Croft. of statistical question classification for fact-based questions. Information Retrieval, 8(3):481 504, 2005. [30] F. Peng, F. Feng, and A. McCallum. Chinese segmentation and new word detection using conditional random fields. Proceedings of the 20th international conference on Computational Linguistics, 2004. [31] Y. Qiu and H. Frei. Concept based query expansion. Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pages 160 169, 1993. [32] B. Smyth, E. Balfe, J. Freyne, P. Briggs, M. Coyle, and O. Boydell. Exploiting query repetition and regularity in an adaptive community-based web search engine. User Modeling and User-Adapted Interaction, 14(5):383 423, 2004. [33] R. Srihari and W. Li. A question answering system supported by information extraction. Proceedings of the sixth conference on Applied natural language processing, pages 166 172, 2000. [34] R. Srihari, W. Li, and N. CYMFONY. Information extraction supported question answering. NIST SPECIAL PUBLI- CATION SP, pages 185 196, 2000. [35] P. Srinivasan, F. Menczer, and G. Pant. A General Evaluation Framework for Topical Crawlers. Information Retrieval, 8(3):417 447, 2005. [36] M. Surdeanu, M. Ciaramita, and H. Zaragoza. Learning to rank answers on large online QA collections. In Proceedings of ACL-08: HLT, pages 719 727, Columbus, Ohio, June 2008. Association for Computational Linguistics. [37] J. Turmo, A. Ageno, and N. Català. Adaptive information extraction. ACM Computing Surveys (CSUR), 38(2), 2006. [38] E. Voorhees. Query expansion using lexical-semantic relations. Springer-Verlag New York, Inc. New York, NY, USA, 1994. [39] E. Voorhees. The trec-8 question answering track report. NIST SPECIAL PUBLICATION SP, pages 77 82, 2000. [40] E. Voorhees. Overview of the trec 2003 question answering track. Proceedings of the Twelfth Text REtrieval Conference (TREC 2003), 142, 2003. [41] C. Wu, J. Yeh, and Y. Lai. Semantic segment extraction and matching for internet faq retrieval. IEEE TRANS- ACTIONS ON KNOWLEDGE AND DATA ENGINEERING, pages 930 940, 2006. [42] J. Xu and W. Croft. Query expansion using local and global document analysis. Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 4 11, 1996. [43] S. Yang, F. Chuang, and C. Ho. Ontology-supported faq processing and ranking techniques. Journal of Intelligent Information Systems, 28(3):233 251, 2007. [44] W. Zhang and T. Chen. Classification based on symmetric maximized minimal distance in subspace (SMMS). In Proc. of IEEE Conf. on Comput. Vision and Pattern Recogn. (CVPR), 2003. [45] Z. Zheng. bus question answering system. Proceedings of the second international conference on Human Language Technology Research, pages 399 404, 2002.