Informa(on Retrieval



Similar documents
A Comparison of Dictionary Implementations

Structure for String Keys

Review of Hashing: Integer Keys

Semantic Analysis: Types and Type Checking

Binary Heap Algorithms

Base Conversion written by Cathy Saxton

benefit of virtualiza/on? Virtualiza/on An interpreter may not work! Requirements for Virtualiza/on 1/06/15 Which of the following is not a poten/al

The programming language C. sws1 1

Retaining globally distributed high availability Art van Scheppingen Head of Database Engineering

Chapter 8: Bags and Sets

Physical Data Organization

Enhanced Algorithm for Efficient Retrieval of Data from a Secure Cloud

Symbol Tables. Introduction

Record Storage and Primary File Organization

Report on the Train Ticketing System

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

The following themes form the major topics of this chapter: The terms and concepts related to trees (Section 5.2).

Hash Tables. Computer Science E-119 Harvard Extension School Fall 2012 David G. Sullivan, Ph.D. Data Dictionary Revisited

ECS 165B: Database System Implementa6on Lecture 2

Full Text Search in MySQL 5.1 New Features and HowTo

Search and Information Retrieval

Storing Measurement Data

Introduction to Information Retrieval

DATA STRUCTURES USING C

Storage Management for Files of Dynamic Records

Data storage Tree indexes

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

1. Define: (a) Variable, (b) Constant, (c) Type, (d) Enumerated Type, (e) Identifier.

Topic Extrac,on from Online Reviews for Classifica,on and Recommenda,on (2013) R. Dong, M. Schaal, M. P. O Mahony, B. Smyth

Why you shouldn't use set (and what you should use instead) Matt Austern

High-performance XML Storage/Retrieval System

Data Warehousing und Data Mining

From Last Time: Remove (Delete) Operation

Fast Arithmetic Coding (FastAC) Implementations

Useful Java Collections

Chapter 2 The Information Retrieval Process

Data Structures Fibonacci Heaps, Amortized Analysis

Node-Based Structures Linked Lists: Implementation

Introduction to Data Structures

Heaps & Priority Queues in the C++ STL 2-3 Trees

1. Relational database accesses data in a sequential form. (Figures 7.1, 7.2)

Compsci 101: Test 1. October 2, 2013

DATABASE DESIGN - 1DL400

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Lecture 1: Data Storage & Index

Multi-dimensional index structures Part I: motivation

Efficient Retrieval of Partial Documents

SQL Query Evaluation. Winter Lecture 23

Indexing and Compression of Text

Efficient representation of integer sets

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Electronic Document Management Using Inverted Files System

Beyond the Mouse A Short Course on Programming

Connec(ng to the NC Educa(on Cloud

DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices

Trace-Based and Sample-Based Profiling in Rational Application Developer

Suppor&ng the Design of Safety Cri&cal Systems Using AADL

! # % & (() % +!! +,./// 0! 1 /!! 2(3)42( 2

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

INTRODUCTION The collection of data that makes up a computerized database must be stored physically on some computer storage medium.

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

Quantum Computing Lecture 7. Quantum Factoring. Anuj Dawar

Lecture Notes on Binary Search Trees

Let s put together a Manual Processor


Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Big Data and Scripting. Part 4: Memory Hierarchies

This section describes how LabVIEW stores data in memory for controls, indicators, wires, and other objects.

Output: struct treenode{ int data; struct treenode *left, *right; } struct treenode *tree_ptr;

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

CS101 Lecture 11: Number Systems and Binary Numbers. Aaron Stevens 14 February 2011

Simple Image File Formats

Requirements Specification and Testing Part 1

X1 Professional Client

Boolean Expressions, Conditions, Loops, and Enumerations. Precedence Rules (from highest to lowest priority)

10CS35: Data Structures Using C

8- Management of large data volumes

Services. Relational. Databases & JDBC. Today. Relational. Databases SQL JDBC. Next Time. Services. Relational. Databases & JDBC. Today.

Can interpreting be as fast as byte compiling? + Other developments in pqr

Fast Sequential Summation Algorithms Using Augmented Data Structures

Lecture 12 Doubly Linked Lists (with Recursion)

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Introduc)on to Hadoop

The Relational Data Model

DATABASDESIGN FÖR INGENJÖRER - 1DL124

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems

CmpSci 187: Programming with Data Structures Spring 2015

Building custom memory profilers. In# Gonzalez- Herrera Diverse Team Advisers: Johann Bourcier and Olivier Barais

Memory management. Announcements. Safe user input. Function pointers. Uses of function pointers. Function pointer example

We will learn the Python programming language. Why? Because it is easy to learn and many people write programs in Python so we can share.

Domain Name System. Heng Sovannarith

Information Retrieval Systems in XML Based Database A review

High Performance Compu2ng Facility

Infrastructures for big data

Java Types and Enums. Nathaniel Osgood MIT April 25, 2012

PES Institute of Technology-BSC QUESTION BANK

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #5: En-ty/Rela-onal Models- - - Part 1

Topics in basic DBMS course

Transcription:

Introduc*on to Informa(on Retrieval Lecture 4: Dic*onaries and tolerant retrieval 1

Ch. 3 This lecture Dic*onary data structures Tolerant retrieval Wild-card queries Spelling correc*on Soundex 2

Sec. 3.1 Dic*onary data structures for inverted indexes The dic*onary data structure stores the term vocabulary, document frequency, pointers to each pos*ngs list in what data structure? 3

Sec. 3.1 A naïve dic*onary An array of struct: char[20] int Pos*ngs * 20 bytes 4/8 bytes 4/8 bytes How do we store a dic*onary in memory efficiently? How do we quickly look up elements at query *me? 4

Sec. 3.1 Dic*onary data structures Two main choices: Hash table Tree Some IR systems use hashes, some trees 5

Sec. 3.1 Hashes Each vocabulary term is hashed to an integer (We assume you ve seen hashtables before) Pros: Lookup is faster than for a tree: O(1) Cons: No easy way to find minor variants: judgment/judgement No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally do the expensive opera*on of rehashing everything 6

Sec. 3.1 Tree: binary tree a-m Root n-z a-hu hy-m n-sh si-z 7

Sec. 3.1 Tree: B-tree a-hu hy-m n-z Defini*on: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4]. 8

Sec. 3.1 Trees Simplest: binary tree More usual: B-trees Trees require a standard ordering of characters and hence strings but we standardly have one Pros: Solves the prefix problem (terms star*ng with hyp) Cons: Slower: O(log M) [and this requires balanced tree] Rebalancing binary trees is expensive But B-trees mi*gate the rebalancing problem 9

Sec. 3.1 Tries Pros: Fast exact search: O( Q ) *me Support other func*onali*es, e.g., longest-prefix match Cons: Naïve implementa*on takes much space. A to tea ted ten i in inn 10

WILD-CARD QUERIES 11

Sec. 3.2 Wild-card queries: * mon*: find all docs containing any word beginning mon. Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon w < moo *mon: find words ending in mon : harder Maintain an addi*onal B-tree for terms backwards. Can retrieve all words in range: nom w < non. Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent? 12

Sec. 3.2 Query processing At this point, we have an enumera*on of all terms in the dic*onary that match the wild-card query. We s*ll have to look up the pos*ngs for each enumerated term. E.g., consider the query: se*ate AND fil*er This may result in the execu*on of many Boolean AND queries. 13

B-trees handle * s at the end of a Sec. 3.2 query term How can we handle * s in the middle of query term? co*2on We could look up co* AND *2on in a B-tree and intersect the two term sets Expensive S*ll need verifica2on to remove false-posi2ves The solu*on: transform wild-card queries so that the * s occur at the end This gives rise to the Permuterm Index. 14

Sec. 3.2.1 Permuterm index For term hello, index under: hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol. Queries: P P* *P *P* Exact match P$ Range match $P* Range match P$* Range match P* P*Q P*Q*R??? Exercise! Range match Q$P* P $?? Q: Why not P*$*? Query = hel*o P=hel, Q=o Lookup o$hel* 15

Sec. 3.2.1 Permuterm query processing Rotate query wild-card to the right Now use B-tree lookup as before. Permuterm problem: quadruples lexicon size Empirical observation for English. How to perform a precise analysis? 16

Sec. 3.2.2 Bigram (k-gram) indexes Enumerate all k-grams (sequence of k chars) occurring in any term e.g., from text April is the cruelest month we get the 2-grams (bigrams) $a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru, ue,el,le,es,st,t$, $m,mo,on,nt,h$ $ is a special word boundary symbol Maintain a second inverted index from bigrams to dic2onary terms that match each bigram. 17

Sec. 3.2.2 Bigram index example The k-gram index finds terms based on a query consis*ng of k-grams (here k=2). $m mace madden mo on among among amortize mound 18

Sec. 3.2.2 Processing wild-cards Query mon* can now be run as $m AND mo AND on Gets terms that match AND version of our wildcard query. But we d enumerate moon. Must verify these terms against query. Surviving enumerated terms are then looked up in the term-document inverted index. Fast, space efficient (compared to permuterm). 19

Sec. 3.2.2 Processing wild-card queries As before, we must execute a Boolean query for each enumerated, filtered term. Wild-cards can result in expensive query execu*on (very large disjunc*ons ) pyth* AND prog* If you encourage laziness people will respond! Type your search terms, use * if you need to. E.g., Alex* will match Alexander. Search Which web search engines allow wildcard queries? 20

Sec. 3.5 Resources IIR 3, MG 4.2 Efficient spell retrieval: K. Kukich. Techniques for automa*cally correc*ng words in text. ACM Compu*ng Surveys 24(4), Dec 1992. J. Zobel and P. Dart. Finding approximate matches in large lexicons. Sosware - prac*ce and experience 25(3), March 1995. htp://citeseer.ist.psu.edu/zobel95finding.html Mikael Tillenius: Efficient Genera*on and Ranking of Spelling Error Correc*ons. Master s thesis at Sweden s Royal Ins*tute of Technology. htp://citeseer.ist.psu.edu/179155.html Nice, easy reading on spell correc(on: Peter Norvig: How to write a spelling corrector htp://norvig.com/spell-correct.html 21