Text Analytics. Introduction to Information Retrieval. Ulf Leser



Similar documents
Search and Information Retrieval

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Making Content Easy to Find. DC2010 Pittsburgh, PA Betsy Fanning AIIM

CONCEPTCLASSIFIER FOR SHAREPOINT

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Big Data Text Mining and Visualization. Anton Heijs

Are you ready for more efficient and effective ways to manage discovery?

Information and documentation The Dublin Core metadata element set

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Enterprise Content Management with Microsoft SharePoint

Quality Assurance Checklists for Evaluating Learning Objects and Online Courses

Mining Text Data: An Introduction

OvidSP Quick Reference Guide

Modern Databases. Database Systems Lecture 18 Natasha Alechina

Enterprise Archive Managed Archiving & ediscovery Services User Manual

Network Working Group

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

DDI Lifecycle: Moving Forward Status of the Development of DDI 4. Joachim Wackerow Technical Committee, DDI Alliance

Web 3.0 image search: a World First

Search Engine Optimization Glossary

Discovery of Electronically Stored Information ECBA conference Tallinn October 2012

Preservation Handbook

Plagiarism. Dr. M.G. Sreekumar UNESCO Coordinator, Greenstone Support for South Asia Head, LRC & CDDL, IIM Kozhikode

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

I. INTRODUCTION NOESIS ONTOLOGIES SEMANTICS AND ANNOTATION

Digital Libraries and Content Management

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Flattening Enterprise Knowledge

Add external resources to your search by including Universal Search sites in either Basic or Advanced mode.

Text Mining and Analysis

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

Figure A Partial list of EBSCOhost databases

Record Tagging for Microsoft Dynamics CRM

Unifying Search for the Desktop, the Enterprise and the Web

CIMM2 New Functionality Series 3

How To Manage Your Digital Assets On A Computer Or Tablet Device

Pragmatic Web 4.0. Towards an active and interactive Semantic Media Web. Fachtagung Semantische Technologien September 2013 HU Berlin

Sharpdesk Solution Sharpdesk Document Management Solution

Lightweight Data Integration using the WebComposition Data Grid Service

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

English 2 - Journalism Mitch Martin: mmartin@naperville203.org

LexisNexis TotalPatent. Training Manual

ebusiness Web Hosting Alternatives Considerations Self hosting Internet Service Provider (ISP) hosting

Content Management & Translation Management

XML. CIS-3152, Spring 2013 Peter C. Chapin

7. Classification. Business value. Structuring (repetition) Automation. Classification (after Leymann/Roller) Automation.

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Indexed vs. Unindexed Searching:

II. PREVIOUS RELATED WORK

Natural Language to Relational Query by Using Parsing Compiler

This Webcast Will Begin Shortly

I.R.I.S. launches IRISPdf 5.0, the new version of its production OCR solution including, for the first time, an Arabic OCR add-on!

Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web

11 ways to migrate Lotus Notes applications to SharePoint and Office 365

Computer software. Computer software: summary/overview/abstract (Part 1)

A Quick Start Guide to Searching the LLC Catalogue and Databases

Evaluation Criteria Defined for Evaluating Open Source Digital Library Software (OSS-DL)

Authoring Guide for Perception Version 3

f...-. I enterprise Amazon SimpIeDB Developer Guide Scale your application's database on the cloud using Amazon SimpIeDB Prabhakar Chaganti Rich Helms

Technical Report. The KNIME Text Processing Feature:

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

A Visual Tagging Technique for Annotating Large-Volume Multimedia Databases

Certified Information Professional 2016 Update Outline

1.00 ATHENS LOG IN BROWSE JOURNALS 4

OPACs' Users' Interface Do They Need Any Improvements? Discussion on Tools, Technology, and Methodology

Stop searching, find your files and s fast.

Internet Technologies_1. Doc. Ing. František Huňka, CSc.

Document Management System (DMS) Release 4.5 User Guide

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

ISSUES ON FORMING METADATA OF EDITORIAL SYSTEM S DOCUMENT MANAGEMENT

Digital Collections as Big Data. Leslie Johnston, Library of Congress Digital Preservation 2012

ARCHIVING FOR EXCHANGE 2013

Offer from Zissor for the Zissor WEB portal for search and viewing of the digitized newspaper archive of Newbury Weekly News

DSpace: An Institutional Repository from the MIT Libraries and Hewlett Packard Laboratories

Survey Results: Requirements and Use Cases for Linguistic Linked Data

DAR: A Digital Assets Repository for Library Collections

Migrate from Exchange Public Folders to Business Productivity Online Standard Suite

Business 360 Online - Product concepts and features

Transcription:

Text Analytics Introduction to Information Retrieval Ulf Leser

Summary of last lecture Ulf Leser: Text Analytics, Winter Semester 2010/2011 2

Content of this Lecture Information Retrieval Introduction Documents Queries Related topics A first idea: Boolean queries and vector space model Ulf Leser: Text Analytics, Winter Semester 2010/2011 3

Information Retrieval (IR) IR is about helping a user IR is about finding information, not about finding data [BYRN99] IR builds systems for end users, not for programmers No SQL IR (web) is used by almost everybody, databases are not IR usually searches unstructured data (e.g. text) But: Keyword search in relational databases 90% of all information is believed to be presented in unstructured form Ulf Leser: Text Analytics, Winter Semester 2010/2011 4

History of IR ~300 ad. Library of Alexandria, ~700.000 documents 1450: Bookprint 19th century: Indices and concordance (expensive and laborious) KWIC: H.P. Luhn, IBM (1958) Probabilistic models: Maron & Kuhns (1960) Boolean queries: Lockheed (~1960) Vector Space Model: Salton, Cornell (1965) Faster, simpler to implement, better search results 80s-90s: Digital libraries, SGML, metadata standards Mid 90s: The web, web search engines End 90s: Personalized search engines, enterprise search, recommendation engines etc. 2010: Mobile and context-based search, social networks Ulf Leser: Text Analytics, Winter Semester 2010/2011 5

The Problem Help user in quickly finding the requested information from a given set of documents Documents: Corpus, library, collection, Quickly: Comfortable, with few queries, without need to learn a lot about the system, Requested: The best-fitting documents; the right passages Query (Ranked) Results IR System Document base Corpus Ulf Leser: Text Analytics, Winter Semester 2010/2011 6

Why is it hard? Homonyms (context) Fenster (Glas, Computer, Brief, ), Berlin (BRD, USA, ), Boden (Dach, Fussboden, Ende von etwas, ), Synonyms Computer, PC, Rechner, Server, Specific queries What was the score of Bayern München versus Stuttgart im DFB Pokal in 1998? Who scored the first goal for Stuttgart? How many hours of sunshine on average has a day in Crete in May? Typical web queries have 1,6 terms Information broker is a profession Ulf Leser: Text Analytics, Winter Semester 2010/2011 7

Quickly Time to execute a query Indexing, parallelization, compression, Time to answer the request Understand request, find best matches Usually implemented in one method Success of search engines: Better results (and fast!) Process-orientation: User feedback, query history, Information overload We are drowning in data, but starving for knowledge In the corpus is large, ranking is top priority Result summarization (grouping on what?) Different search modes: What s new? What s certain? Ulf Leser: Text Analytics, Winter Semester 2010/2011 8

IR: An Iterative, Multi-Stage, Complex Process Q2 Q4 Q1 Q3 Q5 Q0 IR process: Moving through many actions towards a general goal of satisfactory completion of research related to an information need. Berry-picking [Bates 89] Ulf Leser: Text Analytics, Winter Semester 2010/2011 9

Document or Passage Searching only metadata Searching within documents Interpreting documents Ulf Leser: Text Analytics, Winter Semester 2010/2011 10

KWIC: KeyWord In Context Ulf Leser: Text Analytics, Winter Semester 2010/2011 11

Documents This lecture: Natural language text Might be grammatically correct (books, newspapers) or not (BLOGs, newsgroups, spoken language) May have structure (abstract, summary, chapters, ) or not (plain text) May have (explicit or in-text) metadata (title, author, year, ) or not May be in many different languages or even mixed Foreign characters May refer to other documents (hyperlinks) May have various formats (ASCII, PDF, DOC, XML, ) Ulf Leser: Text Analytics, Winter Semester 2010/2011 12

IR and XML Ulf Leser: Text Analytics, Winter Semester 2010/2011 13

XML and IR XML has two flavors Data oriented: Highly structured, except from database, short documents, high volume, no inline tags, structure-oriented search This is the case in most applications Document oriented: Less structured, large documents, inline tags, text-oriented search This was the original purpose of XML (and SGML) XQuery/XPath is targeting the data aspect Focus on structural parts of a query No full-text search, few search options within a value Semantics: Exact matching XML IR is targeting the document aspect Several proposals for languages addressing both Ulf Leser: Text Analytics, Winter Semester 2010/2011 14

IR Queries Users formulate queries Keywords Phrases Logical operations (AND, OR, NOT,...) Web search: -ulf +leser Real questions (e.g. MS-Word help) Structured queries (author= and title~ ) Querying with document Find documents similar to this one Query refinement Refine my query based on first results Find documents within the result set of the previous search Ulf Leser: Text Analytics, Winter Semester 2010/2011 15

Searching with Metadata (PubMed/Medline) Ulf Leser: Text Analytics, Winter Semester 2010/2011 16

Query Refinement Ulf Leser: Text Analytics, Winter Semester 2010/2011 17

Metadata Standard: Dublin Core Dublin Core Metadata Initiative (W3C), 1995 15 attributes identifier: E.g. ISBN/ISSN, URL/PURL, DOI, format: MIME-Typ, media type, type: collection, image, text, language title subject: Keywords coverage: Scope of doc in space and/or time description: free text creator: Last person manipulating the doc publisher: contributor: rights: Copyright, licenses, source: Other doc relation: to other docs date: Date or period Plus refinements Ulf Leser: Text Analytics, Winter Semester 2010/2011 18

Example Usage in HTML <head profile="http://dublincore.org/documents/dcq-html/"> <title>dublin Core</title> <link rel="schema.dc" href="http://purl.org/dc/..." /> <link rel="schema.dcterms" href="http://purl.org/..." /> <meta name="dc.format" scheme= " content="text/htm" /> <meta name="dc.type" scheme= " content="text" /> <meta name="dc.publisher" content="jimmy Whales" /> <meta name="dc.subject" content="dublin Core Metadata" /> <meta name="dc.creator" content="björn G. Kulms" /> <meta name="dcterms.license" scheme="dcterms.uri content="http://www.gnu.org/copyleft/fdl.html" /> </head> Ulf Leser: Text Analytics, Winter Semester 2010/2011 19

Content of this Lecture Information Retrieval Related topics A first idea: Boolean queries and vector space model Ulf Leser: Text Analytics, Winter Semester 2010/2011 20

Historic Texts Sachsenspiegel, ~1250 Swerlenrecht kůnnen wil d~ volge dis buches lere.alrest sul wi mer ken, daz... Multiple representations Facsimile Digitalization / diplomacy How well can the facsimile be reproduced from the dig. form? Differences in individual writers (proliferating errors) Different translations Different editions Ulf Leser: Text Analytics, Winter Semester 2010/2011 21

Other Buzzwords Document management systems (DMS) Large commercial market, links to OCR, workflow systems, print management systems, mail management, etc. Legal issues (compliance, reporting, archival, ) Every DMS includes an IR system Knowledge management More sophisticated DMS with semantic searching Ontologies, thesauri, topic maps, Social aspects: incentives, communities, standard procedures, enterprise vocabulary, if Siemens knew what Siemens knows Digital libraries Somewhat broader and less technical Includes social aspects, archiving, multimedia, Ulf Leser: Text Analytics, Winter Semester 2010/2011 22

Enterprise Content Management The technologies used to capture, manage, store, deliver, and preserve information to support business processes Quelle: AIIM International Authorization and authentication Business process management and document flow Compliance: legal requirements Record management Pharma, Finance, Collaboration and sharing Inter and intra organizations Transactions, locks, Publishing: What, when, where Web, catalogues, mail push, Ulf Leser: Text Analytics, Winter Semester 2010/2011 23

Technique versus Content IR is about techniques for searching given a doc collection Creating doc collections is a business: Content provider Selection/filtering: classified business news, new patents, Augmentation: Annotation with metadata, summarization, linking of additional data, Examples Medline: >5000 Journals, 19 Mill citations, >500K added per year Journals not in Medline don t exist Institute for Scientific Information (ISI) Impact factors: which journals count? Web catalogues ala Yahoo Pressespiegel, web monitoring Ulf Leser: Text Analytics, Winter Semester 2010/2011 24

Content of this Lecture Information Retrieval Related topics A first idea: Boolean queries and vector space model Ulf Leser: Text Analytics, Winter Semester 2010/2011 25

Logical View Definition The logical view on a document denotes its representation inside the corpus Determines what we can query Only metadata, only title, only abstract, full text, Creating the logical view of a doc involves transformations Stemming, stop word removal Transformation of special characters Umlaute, greek letters, XML/HTML encodings, Removal of formatting information (HTML), tags (XML), Bag of words (BoW) Arbitrary yet fixed order (e.g. sorted alphabetically) Ulf Leser: Text Analytics, Winter Semester 2010/2011 26

Preprocessing Tokenization can be a real issue Special characters can be a real issue Format conversion (OCR ) can be a real issue Syntactic analysis can be a real issue Quelle: [BYRN99] Ulf Leser: Text Analytics, Winter Semester 2010/2011 27

A First Approach: Boolean Queries (rough idea) Search all docs which match a logical expression of words bono AND u2, bono AND (Florida OR Texas), bono AND NOT u2 A simple evaluation scheme Compute BoW of all documents: bow(d) Compute all atoms of the query (w) An atom w evaluates to true on b iff w bow(d) Compute values of all atoms for each d Compute value of logical expression for each d Advantages: Fast, simple Efficient implementations do not scan all documents Disadvantages No ranking: Either relevant or not Logic is hard to understand for ordinary users Typical queries are too simple too many results (and no ranking) Ulf Leser: Text Analytics, Winter Semester 2010/2011 28

Ulf Leser: Text Analytics, Winter Semester 2010/2011 29 Vector Space Model (rough idea) Let d by a doc, D be a collection of docs with d D Let v d by a vector with v d [i]=0 if the i th word in bow(d) is not contained in bow(d) v d [i]=1 if the i th word in bow(d) is contained in bow(d) Let v q by the corresponding vector for query q Compute relevance score for q wrt. each d as = = = 2 2 ) ( 1 ] [ * ] [ ] [ ]* [ * ), ( i v i v i v i v v v v v q d rel q d D w i q d q d q d

Angel between Document and Query Relevance: angle between doc Star and query in bow(d) - Astronomy dimensional space Return docs in order of decreasing relevance Advantages Still quite simple Meaningful ranking better results than Boolean model Allows approximate answers (not all terms contained) Disadvantages Words are assumed to be statistically independent Not easy to integrate NOT or OR Movie stars Mammals Diet Ulf Leser: Text Analytics, Winter Semester 2010/2011 30

Outlook: Other techniques We often also want to find documents which don t even contain any of the words we are searching for Docs that contain the concepts we are trying to express by our query Latent Semantic Indexing Map query and docs into a concept space Compare and rank in concept space Technical: Dimensionality reduction (like PCA etc.) Ulf Leser: Text Analytics, Winter Semester 2010/2011 31