Text-based (image) retrieval

Similar documents
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

The University of Lisbon at CLEF 2006 Ad-Hoc Task

An Information Retrieval using weighted Index Terms in Natural Language document collections

Search and Information Retrieval

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task

Mining Text Data: An Introduction

Clever Search: A WordNet Based Wrapper for Internet Search Engines

Basic indexing pipeline

Using COTS Search Engines and Custom Query Strategies at CLEF

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Building a Question Classifier for a TREC-Style Question Answering System

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

Anotaciones semánticas: unidades de busqueda del futuro?

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

isecure: Integrating Learning Resources for Information Security Research and Education The isecure team

Building a Spanish MMTx by using Automatic Translation and Biomedical Ontologies

Search Engines. Stephen Shaw 18th of February, Netsoc

Date submitted: June 1, 2011

By H.S. Hyman, JD, MBA, of University of South Florida, IS/DS Department, and Warren Fridy III of Nova Southeastern University

Information Retrieval Elasticsearch

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

Defining Tags by Linking to Knowledge Bases

Chapter 2 The Information Retrieval Process

KHRESMOI. Medical Information Analysis and Retrieval

A SYSTEM FOR AUTOMATIC QUERY EXPANSION IN A BROWSER-BASED ENVIRONMENT

Towards a Visually Enhanced Medical Search Engine

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

TechWatch. Technology and Market Observation powered by SMILA

A survey on the use of relevance feedback for information access systems

Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Joint Research Centre

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Semantic Jira - Semantic Expert Finder in the Bug Tracking Tool Jira

How To Retrieve Similar Cases From A Medical Record

Cross-lingual Synonymy Overlap

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Semantic Search in Portals using Ontologies

Using Apache Solr for Ecommerce Search Applications

1 o Semestre 2007/2008

A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches

An experience with Semantic Web technologies in the news domain

Keywords cosine similarity, correlation, standard deviation, page count, Enron dataset

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

STARLab, Vrije Universiteit Brussel.

An ontology-based approach for semantic ranking of the web search engines results

Natural Language Processing. Part 4: lexical semantics

I. INTRODUCTION NOESIS ONTOLOGIES SEMANTICS AND ANNOTATION

Computer Aided Document Indexing System

PREDICTING MARKET VOLATILITY FEDERAL RESERVE BOARD MEETING MINUTES FROM

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

Clustering Technique in Data Mining for Text Documents

ImageCLEF 2011

Semantically enhanced Information Retrieval: an ontology-based approach

Text Analytics with Ambiverse. Text to Knowledge.

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Data Pre-Processing in Spam Detection

Automatic Text Processing: Cross-Lingual. Text Categorization

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

Introduction to Information Retrieval

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Information Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Advas A Python Search Engine Module

Social Business Intelligence Text Search System

Information Need Assessment in Information Retrieval

GeoCLEF Administration. Content. Initial Aim of GeoCLEF. Interesting Issues

Identifying Focus, Techniques and Domain of Scientific Papers

Efficient Query Optimizing System for Searching Using Data Mining Technique

Understanding Video Lectures in a Flipped Classroom Setting. A Major Qualifying Project Report. Submitted to the Faculty

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Specialized Search Engines for Arabic Language

OPTIMIZING CONTENT FOR TRANSLATION ACROLINX AND VISTATEC

CLIR-Based Collaborative Construction of a Multilingual Terminological Dictionary for Cultural Resources

Search Result Optimization using Annotators

Service Road Map for ANDS Core Infrastructure and Applications Programs

Text Mining and Analysis

Integration of a Multilingual Keyword Extractor in a Document Management System

Document Re-ranking via Wikipedia Articles for Definition/Biography Type Questions *

An Approach to support Web Service Classification and Annotation

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Domain Analytics. Jay Daley,.nz Registrar Conference, 2015

Lecture Overview. Web 2.0, Tagging, Multimedia, Folksonomies, Lecture, Important, Must Attend, Web 2.0 Definition. Web 2.

Technical challenges in web advertising

A Comparative Study of Online Translation Services for Cross Language Information Retrieval

Software Architecture Document

ifinder ENTERPRISE SEARCH

Why are Organizations Interested?

Improving information retrieval effectiveness by using domain knowledge stored in ontologies

Semantic Search. An introduction to Semantic Search in Microsoft SQL Server Seminar Thesis

LabelTranslator - A Tool to Automatically Localize an Ontology

Semantic Analysis of Song Lyrics

Transcription:

Text-based (image) retrieval Henning Müller HES SO//Valais Sierre, Switzerland

Overview Difference of words and features Weightings instead of distance measures Stemming and pre-treatment Approaches for multilingual retrieval Tools available on the web Lucene,

Text retrieval (of images) Started in the early 1960s for images 1970s Not the main focus of this talk Text retrieval is old!! Many techniques in image retrieval are taken from this domain (sometimes reinvented) It becomes clear that the combination of visual and textual retrieval has biggest potential Good text retrieval engines exist in Open Source

Problems with annotation (of images) Many things are hard to express Feelings, situations, (what is scary?) What is in the image, what is it about, what does it invoke? Annotation is never complete Plus it depends on the goal of the annotation Many ways to say the same thing Synonyms, hyponyms, hypernyms, Mistakes Spelling errors, spelling differences (US vs. UK), weird abbreviations (particularly medical )

Basics in text retrieval Started with boolean search of words in text In combination with AND, OR, NOT No ranking, rather finite list of corresponding documents Vector space model to have distance between search terms and documents Each occurring word is a dimension, its difference in frequency can be measured Overall frequency of words as importance for axis

Zipf distribution (wikipedia example) X- rank Y- number of occurrences of the word

Principle ideas used in text IR Words follow basically a Zipf distribution Tf/idf weightings A word frequent in a document describes it well A word rare in a collection has a high discriminative power Many variations of tf/idf (see also Salton/Buckley paper) Use of inverted files for quick query responses Relevance feedback, query expansion,

Techniques used in text retrieval Bag of words approach Or N-grams can be used Stop words can be removed Stemming can improve results Named entity recognition Spelling correction (also umlauts, accents, ) Google had a big success with this Mapping of text to a controlled vocabulary/ ontology

Stop word removal Very frequent words contain little information and can be removed Automatically in Google et al. These words depend on the language Stop word lists exist in many languages Often 40-50% of texts Contains also less frequent words not carrying information Or simply remove words above a certain frequency

Stemming - conflation Strongly dependent on the language Basically suffix stripping based on a set of rules Cats, catty, catlike=cat as root or stem Can also create errors or slightly change meaning (errors often reported around ~5%) Porter stemmer for English is one of the most well known algorithms with a free implementation

Synonymy, polysemy Synonymy Several words can say the same thing: car, automobile Polysemy The same word can have several meanings Latent semantic Indexing (LSI) Word cooccurences in the entire collection Can reduce effects of synonyms

Query expansion vs. relevance feedback Most queries contain only very few keywords Add keywords to expand the original query Can be automatic or manual Semantically similar words, synonyms, discriminative words Often used in a similar way as relevance feedback but not with entire documents

Medical terminologies MeSH, UMLS are frequently used Mapping of free text to terminologies Quality for the first few is very high Links between items can be used Hyponyms, hypernyms, Several axes exist (anatomy, pathology, ) This can be used for making a query more discriminative This can also be used for multilingual retrieval

Wordnet Hierarchy, links, definitions in English language Maintained in Princeton Car, auto, automobile, machine, motorcar motor vehicle, automotive vehicle vehicle conveyance, transport» instrumentality, instrumentation» artifact, artefact» object, physical object» entity, something

Apache Lucene Open source text retrieval system Written in Java Several tools available Easy to use Used in many research projects and in industry Image retrieval plugin exists LIRE (Lucene Image REtrieval) Using simple MPEG-7 visual features

Multilingual retrieval Many collections are inherently multilingual Web, FlickR, medical teaching files, Translation resources exist on the web TrebleCLEF has a survey of such resources in work Translate query into document language Translate documents into query language Map documents and queries onto a common terminology of concepts We understand documents in other languages

Cross Language Evaluation Forum (CLEF) Forum to compare multilingual retrieval in a variety of domains GeoCLEF QA CLEF Domain-specific CLEF Proceedings are a very good start for multilingual techniques

Challenges in multi-linguality Language pairs have a strongly varying difficulty Families of languages are easier for multilingual retrieval Resources available depend strongly on the languages used English has many resources, German, Spanish and French quite a few but rare languages rather little

Multilingual tools Many translation tools are accessible on the web Yahoo! Babel fish www.reverso.net Google translate Named entity recognition Word-sense disambiguation

Current challenges in text retrieval Many taken from the WWW or linked to it Analysis of link structures to obtain information on potential relevance Also in companies, social platforms, Question of diversity in results You do not want to have the same results show up ten times on the top Retrieval in context (domain specific) Question answering

Diversity Business Information

Conclusions Text retrieval is the basis of image retrieval Many techniques come from this domain Text has more semantics than visual features But other problems as well Text and image features combined have biggest chances for success Use text wherever available Multilinguality is an important issue as most of the web is very multilingual And also a part of research

References G. Salton and C. Buckley, Term weighting approaches in automatic text retrieval, Information Processing and Management, 24(5):513--523, 1988. K. Sparck Jones and C. J. Van Rijsbergen, Progress in documentation, Journal of Documentation}, 32:59--75, 1976. J. J. Rocchio, Relevance feedback in information retrieval, The SMART Retrieval System, Experiments in Automatic Document Processing, pages 313--323. M. Braschler, C. Peters, Cross-Language Evaluation Forum: Objectives, Results, Achievements, Information Retrieval, 2004. J. Gobeill, H. Müller, P. Ruch, Translation by Text Categorization: Medical Image Retrieval in ImageCLEFmed 2006, Springer Lecture Notes in Computer Science (LNCS 4730), pages 706-710, 2007.