Text Analytics. Introduction to Information Retrieval. Ulf Leser

Similar documents
Text Analytics. Introduction to Information Retrieval. Ulf Leser

Search and Information Retrieval

Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives

Introduction to Information Retrieval

CONCEPTCLASSIFIER FOR SHAREPOINT

Information Standards on the Net

Information and documentation The Dublin Core metadata element set

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Making Content Easy to Find. DC2010 Pittsburgh, PA Betsy Fanning AIIM

Electronic Content and Document Management with XML

Open Netscape Internet Browser or Microsoft Internet Explorer and follow these steps:

Search Engine Optimization Glossary

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

Search Engine Technology and Digital Libraries: Moving from Theory t...

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Are you ready for more efficient and effective ways to manage discovery?

Computer software. Computer software: summary/overview/abstract (Part 1)

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Network Working Group

DDI Lifecycle: Moving Forward Status of the Development of DDI 4. Joachim Wackerow Technical Committee, DDI Alliance

II. PREVIOUS RELATED WORK

A Visual Tagging Technique for Annotating Large-Volume Multimedia Databases

SQL Server 2012 Performance White Paper

Web 3.0 image search: a World First

Plagiarism. Dr. M.G. Sreekumar UNESCO Coordinator, Greenstone Support for South Asia Head, LRC & CDDL, IIM Kozhikode

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

ISSUES ON FORMING METADATA OF EDITORIAL SYSTEM S DOCUMENT MANAGEMENT

Whitepaper: Enterprise Vault Discovery Accelerator and Clearwell A Comparison August 2012

EC Wise Report: Unlocking the Value of Deeply Unstructured Data. The Challenge: Gaining Knowledge from Deeply Unstructured Data.

Research Network and Database System (FuD)

Big Data Text Mining and Visualization. Anton Heijs

Lightweight Data Integration using the WebComposition Data Grid Service

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

7. Classification. Business value. Structuring (repetition) Automation. Classification (after Leymann/Roller) Automation.

Experience with Knowledge Management at GRS Part 1: Document and Information Management

Indexed vs. Unindexed Searching:

OPACs' Users' Interface Do They Need Any Improvements? Discussion on Tools, Technology, and Methodology

Tibiscus University, Timişoara

Certified Information Professional 2016 Update Outline

A Quick Start Guide to Searching the LLC Catalogue and Databases

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Chapter 1 Databases and Database Users

Quality Assurance Checklists for Evaluating Learning Objects and Online Courses

Discovery of Electronically Stored Information ECBA conference Tallinn October 2012

Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Chapter 1 Outline

Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web

Fast, Easy to use and On-demand A Content Platform of the 21st Century

OvidSP Quick Reference Guide

Student Data Reporting I Cognos

SEO Optimization A Developer s Role

F. Aiolli - Sistemi Informativi 2007/2008

Record Tagging for Microsoft Dynamics CRM

Flattening Enterprise Knowledge

English 2 - Journalism Mitch Martin: mmartin@naperville203.org

I. INTRODUCTION NOESIS ONTOLOGIES SEMANTICS AND ANNOTATION

Modern Databases. Database Systems Lecture 18 Natasha Alechina

An Ontology Based Text Analytics on Social Media

General principles and architecture of Adlib and Adlib API. Petra Otten Manager Customer Support

Getting started with Mendeley. Guide by ITC library

Migrate from Exchange Public Folders to Business Productivity Online Standard Suite

SQL Server Master Data Services A Point of View

Authoring Guide for Perception Version 3

Text Mining and Analysis

BUILDING A BIODIVERSITY CONTENT MANAGEMENT SYSTEM FOR SCIENCE, EDUCATION, AND OUTREACH

Digital Libraries and Content Management

Semantically Enhanced Web Personalization Approaches and Techniques

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

From Databases to Natural Language: The Unusual Direction

CIMM2 New Functionality Series 3

ebusiness Web Hosting Alternatives Considerations Self hosting Internet Service Provider (ISP) hosting

Configuring SharePoint 2013 Document Management and Search. Scott Jamison Chief Architect & CEO Jornata scott.jamison@jornata.com

Mining Text Data: An Introduction

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Manage Your Shop with Policy Based Management & Central Management Server

Enterprise Content Management with Microsoft SharePoint

Web Design (One Credit), Beginning with School Year

Integrated Library Systems (ILS) Glossary

Content Management Using Rational Unified Process Part 1: Content Management Defined

Invenio: A Modern Digital Library for Grey Literature

Pragmatic Web 4.0. Towards an active and interactive Semantic Media Web. Fachtagung Semantische Technologien September 2013 HU Berlin

Internet Technologies_1. Doc. Ing. František Huňka, CSc.

ASPECTS OF XML TECHNOLOGY IN ebusiness TRANSACTIONS

LabArchives Electronic Lab Notebook:

Semantic SharePoint. Technical Briefing. Helmut Nagy, Semantic Web Company Andreas Blumauer, Semantic Web Company

Introduction to WIPOScan Software

Digital Asset Management

Leverage SharePoint with PSI:Capture

Database Management. Technology Briefing. Modern organizations are said to be drowning in data but starving for information p.

ARCHIVING FOR EXCHANGE 2013

Integrated archiving: streamlining compliance and discovery through content and business process management

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Data Management Services. We Bring Paperless World for you!!

Course 1: Digital Libraries Introduction Unit 4: Digital libraries functional components software

Slide 7. Jashapara, Knowledge Management: An Integrated Approach, 2 nd Edition, Pearson Education Limited Nisan 14 Pazartesi

How To Write A Book On The Digital Age Of Science

Document Management Glossary

EUR-Lex 2012 Data Extraction using Web Services

Transcription:

Text Analytics Introduction to Information Retrieval Ulf Leser

Content of this Lecture Information Retrieval Introduction Documents Queries Related topics A first idea: Boolean queries and vector space model Ulf Leser: Text Analytics, Winter Semester 2012/2013 2

Properties of Information Retrieval (IR) IR is about helping a user IR is about finding information, not about finding data [BYRN99] IR builds systems for end users, not for programmers No SQL IR (web) is used by almost everybody, databases are not IR searches unstructured data (e.g. text) But: Keyword search in relational databases 90% of all information is presented in unstructured form Claim some analysists Ulf Leser: Text Analytics, Winter Semester 2012/2013 3

History ~300 ad. Library of Alexandria, ~700.000 documents 1450: Bookprint 19th century: Indices and concordance Probabilistic models: Maron & Kuhns (1960) Boolean queries: Lockheed (~1960) Vector Space Model: Salton, Cornell (1965) Faster, simpler to implement, better search results 80s-90s: Digital libraries, SGML, metadata standards Mid 90s: The web, web search engines End 90s: Personalized search engines, recommendations 2010: Mobile and context-based search, social networks Ulf Leser: Text Analytics, Winter Semester 2012/2013 4

The Problem Help user in quickly finding the requested information from a given set of documents Documents: Corpus, library, collection, Quickly: Few queries, no need to learn about the system Requested: The best-fitting documents; the right passages; the most relevant content Query (Ranked) Results IR System Document base Corpus Ulf Leser: Text Analytics, Winter Semester 2012/2013 5

Why is it hard? Homonyms (context) Fenster (Glas, Computer, Brief, ), Berlin (BRD, USA, ), Boden (Dach, Fussboden, Ende von etwas, ), Synonyms Computer, PC, Rechner, Server, Specific queries What was the score of Bayern München versus Stuttgart in the DFB Pokal finals in 1998? Who scored the first goal for Stuttgart? How many hours of sunshine on average has a day in Crete in May? Typical web queries have 1,6 terms Information broker is a profession Ulf Leser: Text Analytics, Winter Semester 2012/2013 6

Quickly Time to execute a query Indexing, parallelization, compression, Time to answer the request (may involve multiple queries) Understand request, find best matches Success of search engines: Better results (and fast!) Process-orientation: User feedback, query history, Information overload We are drowning in data, but starving for knowledge In the corpus is large, ranking is a must Result summarization (grouping on what?) Different search modes: What s new? What s certain? Ulf Leser: Text Analytics, Winter Semester 2012/2013 7

IR: An Iterative, Multi-Stage Process Q2 Q4 Q1 Q3 Q5 Q0 IR process: Moving through many actions towards a general goal of satisfactory completion of research related to an information need. Berry-picking [Bates 89] Ulf Leser: Text Analytics, Winter Semester 2012/2013 8

Document or Passage Searching only metadata Searching within documents Interpreting documents Ulf Leser: Text Analytics, Winter Semester 2012/2013 9

Documents This lecture: Natural language text Might be grammatically correct (books, newspapers) or not (Blogs, Twitter, spoken language) May have structure (abstract, chapters, ) or not May have (explicit or in-text) metadata or not May be in many different languages or even mixed Foreign characters May refer to other documents (hyperlinks) May have various formats (ASCII, PDF, DOC, XML, ) Ulf Leser: Text Analytics, Winter Semester 2012/2013 10

XML as Data Container Data oriented Highly structured, exported from database, short documents, high volume, no inline tags, structureoriented search Today predominant applications XQuery/XPath Focus on structural parts of a query No full-text search, few search options within a value Semantics: Exact matching Ulf Leser: Text Analytics, Winter Semester 2012/2013 11

XML as Document Markup Document oriented Less structured, large documents, inline tags, text-oriented search This was the original purpose of XML (and SGML) XML IR is targeting the document aspect Several proposals yet no standard for languages addressing both aspects Ulf Leser: Text Analytics, Winter Semester 2012/2013 12

Quickly Time to execute a query Indexing, parallelization, compression, Time to answer the request (may involve multiple queries) Understand request, find best matches Success of search engines: Better results (and fast!) Process-orientation: User feedback, query history, Information overload We are drowning in data, but starving for knowledge In the corpus is large, ranking is a must Result summarization (grouping on what?) Different search modes: What s new? What s certain? Ulf Leser: Text Analytics, Winter Semester 2012/2013 13

Documents This lecture: Natural language text Might be grammatically correct (books, newspapers) or not (Blogs, Twitter, spoken language) May have structure (abstract, chapters, ) or not May have (explicit or in-text) metadata or not May be in many different languages or even mixed Foreign characters May refer to other documents (hyperlinks) May have various formats (ASCII, PDF, DOC, XML, ) Ulf Leser: Text Analytics, Winter Semester 2012/2013 14

IR Queries Users formulate queries Keywords or phrases Logical operations (AND, OR, NOT,...) Web search: -ulf +leser Natural language questions (e.g. MS-Word help) Structured queries (author= AND title~ ) With documents: Find documents similar to this one Query refinement based on previous results Find documents matching the new query within the result set of the previous search Use relevant answers from previous queries to modify next query Ulf Leser: Text Analytics, Winter Semester 2012/2013 15

Searching with Metadata (PubMed/Medline) Ulf Leser: Text Analytics, Winter Semester 2012/2013 16

Query Refinement Ulf Leser: Text Analytics, Winter Semester 2012/2013 17

Dublin Core Metadata Initiative (W3C), 1995 identifier: ISBN/ISSN, URL/PURL, DOI, format: MIME-Typ, media type, type: Collection, image, text, language title subject: Keywords coverage: Scope of doc in space and/or time description: Free text creator: Last person manipulating the doc publisher: contributor: rights: Copyright, licenses, source: Other doc relation: To other docs date: Date or period Ulf Leser: Text Analytics, Winter Semester 2012/2013 18

Usage in HTML <head profile="http://dublincore.org/documents/dcq-html/"> <title>dublin Core</title> <link rel="schema.dc" href="http://purl.org/dc/..." /> <link rel="schema.dcterms" href="http://purl.org/..." /> <meta name="dc.format" scheme=" " content="text/htm" /> <meta name="dc.type" scheme=" " content="text" /> <meta name="dc.publisher" content="jimmy Whales" /> <meta name="dc.subject" content="dublin Core Metadata" /> <meta name="dc.creator" content="björn G. Kulms" /> <meta name="dcterms.license" scheme="dcterms.uri" content="http://www.gnu.org/copyleft/fdl.html" /> </head> Ulf Leser: Text Analytics, Winter Semester 2012/2013 19

Content of this Lecture Information Retrieval Related topics A first idea: Boolean queries and vector space model Ulf Leser: Text Analytics, Winter Semester 2012/2013 20

Historic Texts Sachsenspiegel, ~1250 Swerlenrecht kůnnen wil d~ volge dis buches lere.alrest sul wi mer ken, daz... Multiple representations Facsimile Digitalization / diplomacy How well can the facsimile be reproduced from the dig. form? Differences in individual writers (proliferating errors) Different translations Different editions Ulf Leser: Text Analytics, Winter Semester 2012/2013 21

Other Buzzwords Document management systems (DMS) Large commercial market, links to OCR, workflow systems, print management systems, mail management, etc. Legal issues (compliance, reporting, archival, ) Every DMS includes an IR system Knowledge management More sophisticated DMS with semantic searching Ontologies, thesauri, topic maps, Social aspects: incentives, communities, standard procedures, enterprise vocabulary, Digital libraries Somewhat broader and less technical Includes social aspects, archiving, multimedia, Ulf Leser: Text Analytics, Winter Semester 2012/2013 22

Enterprise Content Management The technologies used to capture, manage, store, deliver, and preserve information to support business processes Quelle: AIIM International Authorization and authentication Business process management and document flow Compliance: legal requirements Record management Pharma, Finance, Collaboration and sharing Inter and intra organizations Transactions, locks, Publishing: What, when, where Web, catalogues, mail push, Ulf Leser: Text Analytics, Winter Semester 2012/2013 23

Technique versus Content IR is about techniques for searching a given doc collection Creating doc collections is a business: Content provider Selection/filtering: classified business news, new patents, Augmentation: Annotation with metadata, summarization, linking of additional data, Examples Medline: >5000 Journals, <20M citations, >500K added per year Institute for Scientific Information (ISI) Impact factors: which journals count how much? Web catalogues ala Yahoo Pressespiegel, web monitoring Ulf Leser: Text Analytics, Winter Semester 2012/2013 24

Content of this Lecture Information Retrieval Related topics A first idea: Boolean queries and vector space model Ulf Leser: Text Analytics, Winter Semester 2012/2013 25

A First Approach: Boolean Queries (rough idea) Search all docs which match a logical expression of words bono AND u2, bono AND (Florida OR Texas), bono AND NOT u2 A simple evaluation scheme Compute BoW of all documents: bow(d) Compute all atoms of the query (w) An atom w evaluates to true on d iff w bow(d) Compute values of all atoms for each d Compute value of logical expression for each d Ulf Leser: Text Analytics, Winter Semester 2012/2013 26

A First Approach: Boolean Queries (rough idea) Advantages: Fast, simple Efficient implementations do not scan all documents Disadvantages No ranking: Either relevant or not Logic is hard to understand for ordinary users Typical queries are too simple too many results (and no ranking) Ulf Leser: Text Analytics, Winter Semester 2012/2013 27

Ulf Leser: Text Analytics, Winter Semester 2012/2013 28 Vector Space Model (rough idea) Let d by a doc, D be a collection of docs with d D Let v d by a vector with v d [i]=0 if the i th word in bow(d) is not contained in bow(d) v d [i]=1 if the i th word in bow(d) is contained in bow(d) Let v q by the corresponding vector for query q Relevance score: Angle between d and v q = = = 2 2 ) ( 1 ] [ * ] [ ] [ ]* [ * ), ( i v i v i v i v v v v v q d rel q d D w i q d q d q d

Angel between Document and Query Relevance: angle between doc and query in bow(d) - dimensional space Return docs in order of decreasing relevance Advantages Still quite simple Meaningful ranking better results than Boolean model Allows approximate answers (not all terms contained) Disadvantages Star Words are assumed to be statistically independent Not easy to integrate NOT or enforce AND Astronomy Movie stars Mammals Diet Ulf Leser: Text Analytics, Winter Semester 2012/2013 29