Clinical Decision Support with the SPUD Language Model

Size: px
Start display at page:

Download "Clinical Decision Support with the SPUD Language Model"

Transcription

1 Clinical Decision Support with the SPUD Language Model Ronan Cummins The Computer Laboratory, University of Cambridge, UK Abstract. In this paper we present the systems and techniques used by the University of Cambridge for the CDS (Clinical Decision Support) track of the 24 th Text Retrieval Conference (TREC). The system was among the best automatic approaches for both CDS tasks and is based on a new language modelling approach implemented using Lucene. 1 1 Introduction We outline the main models and techniques used to participate in the CDS track of TREC The CDS track consisted of retrieving relevant biomedical articles for answering generic clinical questions about medical records. As the documents are full scientific articles they tend to be much longer than the average general Web document. Furthermore, the types of the queries issued for this task are also much longer than those used for typical web search. Therefore, we adopted the use of our new document language model [1] that has been shown to retrieve longer documents more fairly. The recently developed SPUD language model [1] treats document generation using the Pólya process and aims to model wordburstiness directly. It has been shown to incorporate a number of theoretically interesting properties. For example, it models the scope and verbosity hypothesis [3] separately, and reintroduces a measure closely related to inverse document frequency [4]. Therefore, we hypothesised that the SPUD language model would be well-suited to the retrieval of scientific texts. We also studied the performance of a newly developed query modelling technique which re-weights salient terms in longer queries. Furthermore, we investigated query expansion using two different models. 2 Models We now outline the main models used in our approaches. 1

2 2 Ronan Cummins 2.1 Document Models We model each document as a mixture of multivariate Pólya distributions. The model captures word burstiness by modelling the dependencies between recurrences of the same word-type. Each document is modelled as follows: α d = (1 ω) α τ + ω α c (1) where α d, α τ, and α c are the document model, topic model, 2 and background model respectively. The hyper-parameter ω controls the smoothing and is stable at approximately ω = 0.8. Each of these models are multivariate Pólya distributions with parameters estimated as follows: ˆα τ = {m d c(t, d) d : t d} ˆα c = {m c df t t df t : t C} (2) where m d is the number of word-types (distinct terms) in d, c(t, d) is the count of term t in document d, d is the number of word tokens in d, df t is the document frequency of term t in the collection C, and m c is a background mass parameter that can be estimated via numerical methods (see [1] for details). The scale parameters m d and m c can be interpreted as beliefs in the parameters c(t, d)/ d and df t / t df t respectively. For the document collection in the CDS track the parameter m c is estimated to be 775. The KL-divergence approach to ranking documents can be used with these document models whereby one takes a point-estimate of the distribution (i.e. E[α d ] is a multinomial) and one can rank documents according to the negative KL-divergence between the query distribution and the expected document multinomial. 2.2 Query Models When a user formulates a short keyword query (e.g. hypotension, hypoxia), it is usually assumed that they have already distilled the topical aspect of the information need. Consequently, one may assume that the probability that a particular query token is topical is 1.0 and this can be normalised accordingly to estimate the maximum likelihood query model (i.e. { 1 2, 1 2 }). This is the standard method of estimating query models for use with KL-Divergence. However, when dealing with natural language queries (e.g. A 44-year-old man with coffee-ground emesis, tachycardia, hypoxia, hypotension and cool, clammy extremities.) it is likely that some terms are generated by a background query language model. Therefore, we assume that long natural language queries are generated by drawing terms from a query model α q which consists of both a topical language model α qτ and a background query language model α qc. The topical query model describes the topical information that the user requires, 2 For the purposes of this paper, we refer to the unsmoothed model as the topic model of the document as it explains words not explained by the general background model.

3 Clinical Decision Support with the SPUD Language Model 3 while the background query model describes the syntactic glue of the general query language. Examples of fragments that can be explained by the background query language model are tokens such as I, am, looking, for,, and A, relevant, document, may, include, (a stereotypical TREC construct). Therefore, our new query model is defined as follows: α q = (1 λ q ) α qτ + (λ q ) α qc (3) where λ q is the probability mass of the background query language model. Although the background query language model is likely to contain some structural clues regarding relevance, we simple regard this model as generating noise tokens, and therefore aim to extract the topical part of each query. This can be achieved by determining using Bayes theorem the probability that a particular query term t was generated by the topical query model as follows: p(α qτ t) = (1 λ q ) p(t α qτ ) (1 λ q ) p(t α qτ ) + (λ q ) p(t α qc ) The final step involves determining the distribution of terms in the topical query model α qτ by normalising over the tokens in q as follows: c(t, q) p(α qτ t) p(t α qτ ) = t q (c(t, q) p(α qτ t )) which we call the discriminative query model (DQM). For the specific instantiation of the model using multivariate Pòlya distributions (α qτ ), the probability that a particular term t came from the topical part of the query model, when assuming that the background query model is the general collection, is as follows: p(α qτ t) = c(t, q) c(t, q) + (1 ωq) ω q df t m c q t df t (4) (5) m d (6) which can be plugged into Eq. 5 to yield the DQM using the multivariate Pòlya. The one free parameter in this specific query model is ω q which determines the belief in the background query model. We set this value to be the same as that from the document model such that ω = ω q. This model essentially introduces a type of idf into longer queries, as common terms occurring in the query will be more likely to come from the background model. As a result, these common terms are down-weighted in the query to varying degrees. 2.3 Psuedo-Relevance Feedback Models We adopt two types of pseudo-relevance feedback models. Firstly, we use the standard RM3 model [2] with our SPUD framework, where we assume the top F documents are relevant. Secondly, we use an approach that extracts the topical terms from the feedback documents in a similar manner to that outlined in the previous section. Given a set of feedback documents F, we then rank terms as follows:

4 4 Ronan Cummins p(θ Q t) = d F p(α τ t) p(q α d ) d F p(q α d ) (7) where p(q α d ) is the document score. Given the SPUD language model (Eq. 1) and its parameters estimates (Eq. 2), the probability that the term t was generated from the topical model α τ of a feedback-document can be calculated via Bayes theorem as follows: (1 ω) α τt p(α τ t) = (8) (1 ω) α τt + ω α ct where α τt and α ct are the parameters of t for the document topic model and background model respectively. A relatively simple intuition for this formula is that topical terms are those that are more likely generated from the topical part of a document than those that are generated by the background model. For all approaches we interpolate 30 feedback terms with the original query using linear-interpolation of System and Topics Our models were all implemented in the Lucene retrieval framework. We stemmed all text using Porter s stemmer and removed a small number of stopwords (the 26 stopwords from the Lucene EnglishAnalyzer). The topics in the CDS track are of three different types (diagnosis, test, and treatment) depending on what the specific task the clinician is involved in. An example topic is as follows: <topic number="1" type="diagnosis"> <description>a 44 yo male is brought to the emergency room after multiple bouts of vomiting that has a coffee ground appearance. His heart rate is 135 bpm and blood pressure is 70/40 mmhg. Physical exam findings include decreased mental status and cool extremities. He receives a rapid infusion of crystalloid solution followed by packed red blood cell transfusion and is admitted to the ICU for further care. </description> <summary>a 44-year-old man with coffee-ground emesis, tachycardia, hypoxia, hypotension and cool, clammy extremities. </summary> </topic> 4 Results This section presents the results of the six runs submitted to the CDS track (three for task A and three for task B). Table 1 shows details of the six runs submitted.

5 Clinical Decision Support with the SPUD Language Model 5 Table 1. Details of settings used for each run System Task Doc Model Query Model Feedback Model Topic Fields CAMspud1 A SPUD ω=0.9 DQM ω=0.9 RM3 λ=0.5, F = 5 type + summary CAMspud3 A SPUD ω=0.9 DQM ω=0.9 QTM λ=0.5, F = 5 summary CAMspud5 A SPUD ω=0.9 DQM ω=0.9 RM3 λ=0.5, F = 5 description CAMspud6 B SPUD ω=0.85 DQM ω=0.85 No Feedback diagnosis + summary CAMspud7 B SPUD ω=0.85 DQM ω=0.85 RM3 λ=0.5, F = 10 diagnosis + summary CAMspud8 B SPUD ω=0.85 DQM ω=0.85 QTM λ=0.5, F = 10 diagnosis + summary 4.1 Task A Table 2 shows the MAP, InfAP, and InfNDCG of the three runs submitted for task A. The rows labelled median and best show the median and best performance per topic averaged over all 30 topics for all of the runs in the track. It is worth noting that best does not represent one system, rather it indicates an upper-bound or oracle approach. All of our approaches performed above the median which is encouraging, with CAMspud1 being our best run for task A. Our best run is very close in performance to the best single run for task A (labelled top system). This approach used the type and summary fields with RM3 relevance feedback of 30 terms. Overall, our system was the third best for task A (out of 36 automatic systems). Table 2. MAP, InfAP, and InfNDCG over 30 Topics System MAP InfAP InfNDCG CAMspud CAMspud CAMspud top system median best Task B Table 3 shows the MAP, InfAP, and InfNDCG of the three runs submitted for task B (where a diagnosis field is included in the topics of type treatment and test). For task B we used both the diagnosis and summary field in all of our runs. Again all of our runs outperform the median run. CAMspud6 does not use pseudo-relevance query expansion and is the worst of the three runs. The only

6 6 Ronan Cummins difference between CAMspud7 and CAMspud8 is that the former uses RM3 pseudo-relevance expansion, while the latter uses the new query topic modelling approach. As these results are higher than in task A, it suggests that including the diagnosis field is useful. Overall, our system was the fifth best for task B (out of 26 automatic systems). Table 3. MAP, InfAP, and InfNDCG over 30 Topics System MAP InfAP InfNDCG CAMspud CAMspud CAMspud top system median best Summary We experimented with using the new SPUD language modelling approach in the CDS track. In general the approach is quite effective and is among the top five systems on both of the CDS tasks. In summary, we found query expansion to be beneficial, the summary field to be more effective than the description field, and we found that using the diagnosis field (when available) also leads to an improvement in the task. References 1. Ronan Cummins, Jiaul H. Paik, and Yuanhua Lv. A Pólya urn document language model for improved information retrieval. ACM Transactions of Informations Systems, 33(4):21, Victor Lavrenko and W. Bruce Croft. Relevance based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 01, pages , New York, NY, USA, ACM. 3. S. E. Robertson and S. Walker. Some simple effective approximations to the 2- poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 94, pages , New York, NY, USA, Springer-Verlag New York, Inc. 4. Karen Spärck-Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11 21, 1972.

TEMPER : A Temporal Relevance Feedback Method

TEMPER : A Temporal Relevance Feedback Method TEMPER : A Temporal Relevance Feedback Method Mostafa Keikha, Shima Gerani and Fabio Crestani {mostafa.keikha, shima.gerani, fabio.crestani}@usi.ch University of Lugano, Lugano, Switzerland Abstract. The

More information

UMass at TREC 2008 Blog Distillation Task

UMass at TREC 2008 Blog Distillation Task UMass at TREC 2008 Blog Distillation Task Jangwon Seo and W. Bruce Croft Center for Intelligent Information Retrieval University of Massachusetts, Amherst Abstract This paper presents the work done for

More information

How To Improve A Ccr Task

How To Improve A Ccr Task The University of Illinois Graduate School of Library and Information Science at TREC 2013 Miles Efron, Craig Willis, Peter Organisciak, Brian Balsamo, Ana Lucic Graduate School of Library and Information

More information

An Information Retrieval using weighted Index Terms in Natural Language document collections

An Information Retrieval using weighted Index Terms in Natural Language document collections Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia

More information

Terrier: A High Performance and Scalable Information Retrieval Platform

Terrier: A High Performance and Scalable Information Retrieval Platform Terrier: A High Performance and Scalable Information Retrieval Platform Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, Christina Lioma Department of Computing Science University

More information

Axiomatic Analysis and Optimization of Information Retrieval Models

Axiomatic Analysis and Optimization of Information Retrieval Models Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang Zhai Dept. of Computer Science University of Illinois at Urbana Champaign USA http://www.cs.illinois.edu/homes/czhai Hui Fang

More information

Relevance Feedback versus Local Context Analysis as Term Suggestion Devices: Rutgers TREC 8 Interactive Track Experience

Relevance Feedback versus Local Context Analysis as Term Suggestion Devices: Rutgers TREC 8 Interactive Track Experience Relevance Feedback versus Local Context Analysis as Term Suggestion Devices: Rutgers TREC 8 Interactive Track Experience Abstract N.J. Belkin, C. Cool*, J. Head, J. Jeng, D. Kelly, S. Lin, L. Lobash, S.Y.

More information

A survey on the use of relevance feedback for information access systems

A survey on the use of relevance feedback for information access systems A survey on the use of relevance feedback for information access systems Ian Ruthven Department of Computer and Information Sciences University of Strathclyde, Glasgow, G1 1XH. Ian.Ruthven@cis.strath.ac.uk

More information

Improving Web Page Retrieval using Search Context from Clicked Domain Names

Improving Web Page Retrieval using Search Context from Clicked Domain Names Improving Web Page Retrieval using Search Context from Clicked Domain Names Rongmei Li School of Electrical, Mathematics, and Computer Science University of Twente P.O.Box 217, 7500 AE, Enschede, the Netherlands

More information

Relevance Feedback between Web Search and the Semantic Web

Relevance Feedback between Web Search and the Semantic Web Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Relevance Feedback between Web Search and the Semantic Web Harry Halpin and Victor Lavrenko School of Informatics

More information

UMass at TREC WEB 2014: Entity Query Feature Expansion using Knowledge Base Links

UMass at TREC WEB 2014: Entity Query Feature Expansion using Knowledge Base Links UMass at TREC WEB 2014: Entity Query Feature Expansion using Knowledge Base Links Laura Dietz and Patrick Verga University of Massachusetts Amherst, MA, U.S.A. {dietz,pat}@cs.umass.edu Abstract Entity

More information

Predicting Query Performance in Intranet Search

Predicting Query Performance in Intranet Search Predicting Query Performance in Intranet Search Craig Macdonald University of Glasgow Glasgow, G12 8QQ, U.K. craigm@dcs.gla.ac.uk Ben He University of Glasgow Glasgow, G12 8QQ, U.K. ben@dcs.gla.ac.uk Iadh

More information

Towards a Visually Enhanced Medical Search Engine

Towards a Visually Enhanced Medical Search Engine Towards a Visually Enhanced Medical Search Engine Lavish Lalwani 1,2, Guido Zuccon 1, Mohamed Sharaf 2, Anthony Nguyen 1 1 The Australian e-health Research Centre, Brisbane, Queensland, Australia; 2 The

More information

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Incorporating Window-Based Passage-Level Evidence in Document Retrieval Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological

More information

Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

More information

The University of Lisbon at CLEF 2006 Ad-Hoc Task

The University of Lisbon at CLEF 2006 Ad-Hoc Task The University of Lisbon at CLEF 2006 Ad-Hoc Task Nuno Cardoso, Mário J. Silva and Bruno Martins Faculty of Sciences, University of Lisbon {ncardoso,mjs,bmartins}@xldb.di.fc.ul.pt Abstract This paper reports

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

University of Glasgow at TREC 2007: Experiments in Blog and Enterprise Tracks with Terrier

University of Glasgow at TREC 2007: Experiments in Blog and Enterprise Tracks with Terrier University of Glasgow at TREC 2007: Experiments in Blog and Enterprise Tracks with Terrier David Hannah, Craig Macdonald, Jie Peng, Ben He, Iadh Ounis Department of Computing Science University of Glasgow

More information

Blog feed search with a post index

Blog feed search with a post index DOI 10.1007/s10791-011-9165-9 Blog feed search with a post index Wouter Weerkamp Krisztian Balog Maarten de Rijke Received: 18 February 2010 / Accepted: 18 February 2011 Ó The Author(s) 2011. This article

More information

Can the Content of Public News be used to Forecast Abnormal Stock Market Behaviour?

Can the Content of Public News be used to Forecast Abnormal Stock Market Behaviour? Seventh IEEE International Conference on Data Mining Can the Content of Public News be used to Forecast Abnormal Stock Market Behaviour? Calum Robertson Information Research Group Queensland University

More information

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group Medical Information-Retrieval Systems Dong Peng Medical Informatics Group Outline Evolution of medical Information-Retrieval (IR). The information retrieval process. The trend of medical information retrieval

More information

Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives

Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives 7 XIN CAO and GAO CONG, Nanyang Technological University BIN CUI, Peking University CHRISTIAN S.

More information

Query difficulty, robustness and selective application of query expansion

Query difficulty, robustness and selective application of query expansion Query difficulty, robustness and selective application of query expansion Giambattista Amati, Claudio Carpineto, and Giovanni Romano Fondazione Ugo Bordoni Rome Italy gba, carpinet, romano@fub.it Abstract.

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9

Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9 Homework 2 Page 110: Exercise 6.10; Exercise 6.12 Page 116: Exercise 6.15; Exercise 6.17 Page 121: Exercise 6.19 Page 122: Exercise 6.20; Exercise 6.23; Exercise 6.24 Page 131: Exercise 7.3; Exercise 7.5;

More information

Retrieving Medical Literature for Clinical Decision Support

Retrieving Medical Literature for Clinical Decision Support Retrieving Medical Literature for Clinical Decision Support Luca Soldaini, Arman Cohan, Andrew Yates, Nazli Goharian, and Ophir Frieder Information Retrieval Lab, Georgetown University {luca, arman, andrew,

More information

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,

More information

A Hybrid Method for Opinion Finding Task (KUNLP at TREC 2008 Blog Track)

A Hybrid Method for Opinion Finding Task (KUNLP at TREC 2008 Blog Track) A Hybrid Method for Opinion Finding Task (KUNLP at TREC 2008 Blog Track) Linh Hoang, Seung-Wook Lee, Gumwon Hong, Joo-Young Lee, Hae-Chang Rim Natural Language Processing Lab., Korea University (linh,swlee,gwhong,jylee,rim)@nlp.korea.ac.kr

More information

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

Comparison of Standard and Zipf-Based Document Retrieval Heuristics Comparison of Standard and Zipf-Based Document Retrieval Heuristics Benjamin Hoffmann Universität Stuttgart, Institut für Formale Methoden der Informatik Universitätsstr. 38, D-70569 Stuttgart, Germany

More information

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC xshen@cs.uiuc.edu Hong Cheng CS Dept, UIUC hcheng3@uiuc.edu Abstract Most current search engines present the user a ranked

More information

SIGIR 2004 Workshop: RIA and "Where can IR go from here?"

SIGIR 2004 Workshop: RIA and Where can IR go from here? SIGIR 2004 Workshop: RIA and "Where can IR go from here?" Donna Harman National Institute of Standards and Technology Gaithersburg, Maryland, 20899 donna.harman@nist.gov Chris Buckley Sabir Research, Inc.

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

Supporting Keyword Search in Product Database: A Probabilistic Approach

Supporting Keyword Search in Product Database: A Probabilistic Approach Supporting Keyword Search in Product Database: A Probabilistic Approach Huizhong Duan 1, ChengXiang Zhai 2, Jinxing Cheng 3, Abhishek Gattani 4 University of Illinois at Urbana-Champaign 1,2 Walmart Labs

More information

Eng. Mohammed Abdualal

Eng. Mohammed Abdualal Islamic University of Gaza Faculty of Engineering Computer Engineering Department Information Storage and Retrieval (ECOM 5124) IR HW 5+6 Scoring, term weighting and the vector space model Exercise 6.2

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Modern Information Retrieval: A Brief Overview

Modern Information Retrieval: A Brief Overview Modern Information Retrieval: A Brief Overview Amit Singhal Google, Inc. singhal@google.com Abstract For thousands of years people have realized the importance of archiving and finding information. With

More information

Using Contextual Information to Improve Search in Email Archives

Using Contextual Information to Improve Search in Email Archives Using Contextual Information to Improve Search in Email Archives Wouter Weerkamp, Krisztian Balog, and Maarten de Rijke ISLA, University of Amsterdam, Kruislaan 43, 198 SJ Amsterdam, The Netherlands w.weerkamp@uva.nl,

More information

Early Warning Scores (EWS) Clinical Sessions 2011 By Bhavin Doshi

Early Warning Scores (EWS) Clinical Sessions 2011 By Bhavin Doshi Early Warning Scores (EWS) Clinical Sessions 2011 By Bhavin Doshi What is EWS? After qualifying, junior doctors are expected to distinguish between the moderately sick patients who can be managed in the

More information

Using Discharge Summaries to Improve Information Retrieval in Clinical Domain

Using Discharge Summaries to Improve Information Retrieval in Clinical Domain Using Discharge Summaries to Improve Information Retrieval in Clinical Domain Dongqing Zhu 1, Wu Stephen 2, Masanz James 2, Ben Carterette 1, and Hongfang Liu 2 1 University of Delaware, 101 Smith Hall,

More information

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie

More information

Gender-based Models of Location from Flickr

Gender-based Models of Location from Flickr Gender-based Models of Location from Flickr Neil O Hare Yahoo! Research, Barcelona, Spain nohare@yahoo-inc.com Vanessa Murdock Microsoft vanessa.murdock@yahoo.com ABSTRACT Geo-tagged content from social

More information

Atigeo at TREC 2014 Clinical Decision Support Task

Atigeo at TREC 2014 Clinical Decision Support Task Atigeo at TREC 2014 Clinical Decision Support Task Yishul Wei*, ChenChieh Hsu*, Alex Thomas, Joseph F. McCarthy Atigeo, LLC 800 Bellevue Way NE, Suite 600 Bellevue, WA 98004 USA joe.mccarthy@atigeo.com

More information

Mining Text Data: An Introduction

Mining Text Data: An Introduction Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

More information

Emoticon Smoothed Language Models for Twitter Sentiment Analysis

Emoticon Smoothed Language Models for Twitter Sentiment Analysis Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence Emoticon Smoothed Language Models for Twitter Sentiment Analysis Kun-Lin Liu, Wu-Jun Li, Minyi Guo Shanghai Key Laboratory of

More information

Topic models for Sentiment analysis: A Literature Survey

Topic models for Sentiment analysis: A Literature Survey Topic models for Sentiment analysis: A Literature Survey Nikhilkumar Jadhav 123050033 June 26, 2014 In this report, we present the work done so far in the field of sentiment analysis using topic models.

More information

Using Transactional Data From ERP Systems for Expert Finding

Using Transactional Data From ERP Systems for Expert Finding Using Transactional Data from ERP Systems for Expert Finding Lars K. Schunk 1 and Gao Cong 2 1 Dynaway A/S, Alfred Nobels Vej 21E, 9220 Aalborg Øst, Denmark 2 School of Computer Engineering, Nanyang Technological

More information

A Generalized Framework of Exploring Category Information for Question Retrieval in Community Question Answer Archives

A Generalized Framework of Exploring Category Information for Question Retrieval in Community Question Answer Archives A Generalized Framework of Exploring Category Information for Question Retrieval in Community Question Answer Archives Xin Cao 1, Gao Cong 1, 2, Bin Cui 3, Christian S. Jensen 1 1 Department of Computer

More information

PREDICTING MARKET VOLATILITY FEDERAL RESERVE BOARD MEETING MINUTES FROM

PREDICTING MARKET VOLATILITY FEDERAL RESERVE BOARD MEETING MINUTES FROM PREDICTING MARKET VOLATILITY FROM FEDERAL RESERVE BOARD MEETING MINUTES Reza Bosagh Zadeh and Andreas Zollmann Lab Advisers: Noah Smith and Bryan Routledge GOALS Make Money! Not really. Find interesting

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

Information Retrieval Models

Information Retrieval Models Information Retrieval Models Djoerd Hiemstra University of Twente 1 Introduction author version Many applications that handle information on the internet would be completely inadequate without the support

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction Latent Semantic Indexing with Selective Query Expansion Andy Garron April Kontostathis Department of Mathematics and Computer Science Ursinus College Collegeville PA 19426 Abstract This article describes

More information

Aggregating Evidence from Hospital Departments to Improve Medical Records Search

Aggregating Evidence from Hospital Departments to Improve Medical Records Search Aggregating Evidence from Hospital Departments to Improve Medical Records Search NutLimsopatham 1,CraigMacdonald 2,andIadhOunis 2 School of Computing Science University of Glasgow G12 8QQ, Glasgow, UK

More information

Information Retrieval Elasticsearch

Information Retrieval Elasticsearch Information Retrieval Elasticsearch IR Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Content visualization of scientific corpora using an extensible relational database implementation

Content visualization of scientific corpora using an extensible relational database implementation . Content visualization of scientific corpora using an extensible relational database implementation Eleftherios Stamatogiannakis, Ioannis Foufoulas, Theodoros Giannakopoulos, Harry Dimitropoulos, Natalia

More information

Improving Student Question Classification

Improving Student Question Classification Improving Student Question Classification Cecily Heiner and Joseph L. Zachary {cecily,zachary}@cs.utah.edu School of Computing, University of Utah Abstract. Students in introductory programming classes

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information

Efficient Recruitment and User-System Performance

Efficient Recruitment and User-System Performance Monitoring User-System Performance in Interactive Retrieval Tasks Liudmila Boldareva Arjen P. de Vries Djoerd Hiemstra University of Twente CWI Dept. of Computer Science INS1 PO Box 217 PO Box 94079 7500

More information

Improving Contextual Suggestions using Open Web Domain Knowledge

Improving Contextual Suggestions using Open Web Domain Knowledge Improving Contextual Suggestions using Open Web Domain Knowledge Thaer Samar, 1 Alejandro Bellogín, 2 and Arjen de Vries 1 1 Centrum Wiskunde & Informatica, Amsterdam, The Netherlands 2 Universidad Autónoma

More information

Optimization of Internet Search based on Noun Phrases and Clustering Techniques

Optimization of Internet Search based on Noun Phrases and Clustering Techniques Optimization of Internet Search based on Noun Phrases and Clustering Techniques R. Subhashini Research Scholar, Sathyabama University, Chennai-119, India V. Jawahar Senthil Kumar Assistant Professor, Anna

More information

Query term suggestion in academic search

Query term suggestion in academic search Query term suggestion in academic search Suzan Verberne 1, Maya Sappelli 1,2, and Wessel Kraaij 2,1 1. Institute for Computing and Information Sciences, Radboud University Nijmegen 2. TNO, Delft Abstract.

More information

Information Retrieval Systems in XML Based Database A review

Information Retrieval Systems in XML Based Database A review Information Retrieval Systems in XML Based Database A review Preeti Pandey 1, L.S.Maurya 2 Research Scholar, IT Department, SRMSCET, Bareilly, India 1 Associate Professor, IT Department, SRMSCET, Bareilly,

More information

Retrieval and Feedback Models for Blog Feed Search

Retrieval and Feedback Models for Blog Feed Search Retrieval and Feedback Models for Blog Feed Search Jonathan L. Elsas, Jaime Arguello, Jamie Callan and Jaime G. Carbonell Language Technologies Institute School of Computer Science Carnegie Mellon University

More information

Combining Global and Personal Anti-Spam Filtering

Combining Global and Personal Anti-Spam Filtering Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized

More information

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015 W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction

More information

Prediction of Heart Disease Using Naïve Bayes Algorithm

Prediction of Heart Disease Using Naïve Bayes Algorithm Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

Huy Nguyen. Department of Electrical Engineering and Computer Science May 22, 2009 Certified by / VI-

Huy Nguyen. Department of Electrical Engineering and Computer Science May 22, 2009 Certified by / VI- Improving Search Quality of the Google Search Appliance by Huy Nguyen Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:

More information

Ranked Keyword Search in Cloud Computing: An Innovative Approach

Ranked Keyword Search in Cloud Computing: An Innovative Approach International Journal of Computational Engineering Research Vol, 03 Issue, 6 Ranked Keyword Search in Cloud Computing: An Innovative Approach 1, Vimmi Makkar 2, Sandeep Dalal 1, (M.Tech) 2,(Assistant professor)

More information

An Exploration of Ranking Heuristics in Mobile Local Search

An Exploration of Ranking Heuristics in Mobile Local Search An Exploration of Ranking Heuristics in Mobile Local Search ABSTRACT Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801, USA ylv2@uiuc.edu Users increasingly

More information

Christfried Webers. Canberra February June 2015

Christfried Webers. Canberra February June 2015 c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

More information

Ranking on Data Manifolds

Ranking on Data Manifolds Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname

More information

THUTR: A Translation Retrieval System

THUTR: A Translation Retrieval System THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for

More information

A Statistical Text Mining Method for Patent Analysis

A Statistical Text Mining Method for Patent Analysis A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, shjun@cju.ac.kr Abstract Most text data from diverse document databases are unsuitable for analytical

More information

Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives

Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives Search The Way You Think Copyright 2009 Coronado, Ltd. All rights reserved. All other product names and logos

More information

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,

More information

Information Retrieval System Assigning Context to Documents by Relevance Feedback

Information Retrieval System Assigning Context to Documents by Relevance Feedback Information Retrieval System Assigning Context to Documents by Relevance Feedback Narina Thakur Department of CSE Bharati Vidyapeeth College Of Engineering New Delhi, India Deepti Mehrotra ASCS Amity University,

More information

FINANCIAL SERVICES BOARD INSURANCE DEPARTMENT

FINANCIAL SERVICES BOARD INSURANCE DEPARTMENT FINANCIAL SERVICES BOARD INSURANCE DEPARTMENT SPECIAL REPORT ON THE RESULTS OF THE LONG-TERM INSURANCE INDUSTRY FOR THE PERIOD ENDED MARCH 6 June 1 SPECIAL REPORT ON THE RESULTS OF THE LONG-TERM INSURANCE

More information

A Study of Using an Out-Of-Box Commercial MT System for Query Translation in CLIR

A Study of Using an Out-Of-Box Commercial MT System for Query Translation in CLIR A Study of Using an Out-Of-Box Commercial MT System for Query Translation in CLIR Dan Wu School of Information Management Wuhan University, Hubei, China woodan@whu.edu.cn Daqing He School of Information

More information

Direct Marketing Profit Model. Bruce Lund, Marketing Associates, Detroit, Michigan and Wilmington, Delaware

Direct Marketing Profit Model. Bruce Lund, Marketing Associates, Detroit, Michigan and Wilmington, Delaware Paper CI-04 Direct Marketing Profit Model Bruce Lund, Marketing Associates, Detroit, Michigan and Wilmington, Delaware ABSTRACT A net lift model gives the expected incrementality (the incremental rate)

More information

HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as...

HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as... HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1 PREVIOUSLY used confidence intervals to answer questions such as... You know that 0.25% of women have red/green color blindness. You conduct a study of men

More information

Mining Semi-Structured Online Knowledge Bases to Answer Natural Language Questions on Community QA Websites

Mining Semi-Structured Online Knowledge Bases to Answer Natural Language Questions on Community QA Websites Mining Semi-Structured Online Knowledge Bases to Answer Natural Language Questions on Community QA Websites ABSTRACT Parikshit Sondhi Department of Computer Science University of Illinois at Urbana-Champaign

More information

Cengage Learning at TREC 2011 Medical Track

Cengage Learning at TREC 2011 Medical Track Cengage Learning at TREC 2011 Medical Track Benjamin King University of Michigan benjaminking@umich.edu Lijun Wang Wayne State University ljwang@wayne.edu Ivan Provalov Cengage Learning ivan.provalov@cengage.com

More information

Query Modification through External Sources to Support Clinical Decisions

Query Modification through External Sources to Support Clinical Decisions Query Modification through External Sources to Support Clinical Decisions Raymond Wan 1, Jannifer Hiu-Kwan Man 2, and Ting-Fung Chan 1 1 School of Life Sciences and the State Key Laboratory of Agrobiotechnology,

More information

University of Chicago at NTCIR4 CLIR: Multi-Scale Query Expansion

University of Chicago at NTCIR4 CLIR: Multi-Scale Query Expansion University of Chicago at NTCIR4 CLIR: Multi-Scale Query Expansion Gina-Anne Levow University of Chicago 1100 E. 58th St, Chicago, IL 60637, USA levow@cs.uchicago.edu Abstract Pseudo-relevance feedback,

More information

Procedia Computer Science 00 (2012) 1 21. Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

Procedia Computer Science 00 (2012) 1 21. Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu. Procedia Computer Science 00 (2012) 1 21 Procedia Computer Science Top-k best probability queries and semantics ranking properties on probabilistic databases Trieu Minh Nhut Le, Jinli Cao, and Zhen He

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Ranking using Multiple Document Types in Desktop Search

Ranking using Multiple Document Types in Desktop Search Ranking using Multiple Document Types in Desktop Search Jinyoung Kim and W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst {jykim,croft}@cs.umass.edu

More information

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS

More information

Efficient Cluster Detection and Network Marketing in a Nautural Environment

Efficient Cluster Detection and Network Marketing in a Nautural Environment A Probabilistic Model for Online Document Clustering with Application to Novelty Detection Jian Zhang School of Computer Science Cargenie Mellon University Pittsburgh, PA 15213 jian.zhang@cs.cmu.edu Zoubin

More information

Learning to Expand Queries Using Entities

Learning to Expand Queries Using Entities This is a preprint of the article Brandão, W. C., Santos, R. L. T., Ziviani, N., de Moura, E. S. and da Silva, A. S. (2014), Learning to expand queries using entities. Journal of the Association for Information

More information

Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search

Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search Comparing Tag Clouds, Term Histograms, and Term Lists for Enhancing Personalized Web Search Orland Hoeber and Hanze Liu Department of Computer Science, Memorial University St. John s, NL, Canada A1B 3X5

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,

More information

Chapter 2 The Information Retrieval Process

Chapter 2 The Information Retrieval Process Chapter 2 The Information Retrieval Process Abstract What does an information retrieval system look like from a bird s eye perspective? How can a set of documents be processed by a system to make sense

More information

Health Service Circular

Health Service Circular Health Service Circular Series Number: HSC 2002/009 Issue Date: 04 July 2002 Review Date: 04 July 2005 Category: Public Health Status: Action sets out a specific action on the part of the recipient with

More information

Blocking Blog Spam with Language Model Disagreement

Blocking Blog Spam with Language Model Disagreement Blocking Blog Spam with Language Model Disagreement Gilad Mishne Informatics Institute, University of Amsterdam Kruislaan 403, 1098SJ Amsterdam The Netherlands gilad@science.uva.nl David Carmel, Ronny

More information