Modern Information Retrieval

Similar documents
Image Compression through DCT and Huffman Coding Technique

Multimedia Systems WS 2010/2011

Information, Entropy, and Coding

Analysis of Compression Algorithms for Program Data

Storage Optimization in Cloud Environment using Compression Algorithm

Compression techniques

On the Use of Compression Algorithms for Network Traffic Classification

ONLINE ENGLISH LANGUAGE RESOURCES

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

Streaming Lossless Data Compression Algorithm (SLDC)

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran

CHAPTER 2 LITERATURE REVIEW

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

International Journal of Advanced Research in Computer Science and Software Engineering

SEPTEMBER Unit 1 Page Learning Goals 1 Short a 2 b 3-5 blends 6-7 c as in cat 8-11 t p

Writing Common Core KEY WORDS

A Catalogue of the Steiner Triple Systems of Order 19

Class Notes CS Creating and Using a Huffman Code. Ref: Weiss, page 433

Indexing and Compression of Text

Data Pre-Processing in Spam Detection

Arithmetic Coding: Introduction

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Chapter 2 The Information Retrieval Process

Index. 344 Grammar and Language Workbook, Grade 8

Albert Pye and Ravensmere Schools Grammar Curriculum

Storage Management for Files of Dynamic Records

Correlation: ELLIS. English language Learning and Instruction System. and the TOEFL. Test Of English as a Foreign Language

How To Write A Dissertation

HIGH DENSITY DATA STORAGE IN DNA USING AN EFFICIENT MESSAGE ENCODING SCHEME Rahul Vishwakarma 1 and Newsha Amiri 2

Project Planning and Project Estimation Techniques. Naveen Aggarwal

Livingston Public Schools Scope and Sequence K 6 Grammar and Mechanics

Physical Data Organization

THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM

Pupil SPAG Card 1. Terminology for pupils. I Can Date Word

Symbol Tables. Introduction

Department of Computer Science. The University of Arizona. Tucson Arizona

Big Data & Scripting Part II Streaming Algorithms

Searching BWT compressed text with the Boyer-Moore algorithm and binary search

ON THE LIMITS OF TEXT FILE COMPRESSION. V. y. SHEN and M. H. HALSTEAD. Computer Sciences Department Purdue University w. Lafayette, IN 47907

Reading.. IMAGE COMPRESSION- I IMAGE COMPRESSION. Image compression. Data Redundancy. Lossy vs Lossless Compression. Chapter 8.

Key Components of WAN Optimization Controller Functionality

SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON

A comprehensive survey on various ETC techniques for secure Data transmission

Tape Drive Data Compression Q & A

This image cannot currently be displayed. Course Catalog. Language Arts Glynlyon, Inc.

Interactive Dynamic Information Extraction

Development and Evaluation of Point Cloud Compression for the Point Cloud Library

Third Grade Language Arts Learning Targets - Common Core

Solutions to Problem Set 1

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

Curriculum Catalog

Building a Question Classifier for a TREC-Style Question Answering System

ANALYSIS OF THE EFFECTIVENESS IN IMAGE COMPRESSION FOR CLOUD STORAGE FOR VARIOUS IMAGE FORMATS

encoding compression encryption

A Perfect CRIME? TIME Will Tell. Tal Be ery, Web research TL

Cyber Security Workshop Encryption Reference Manual

Internationalizing the Domain Name System. Šimon Hochla, Anisa Azis, Fara Nabilla

Fast Arithmetic Coding (FastAC) Implementations

zdelta: An Efficient Delta Compression Tool

UNIVERSITÀ DEGLI STUDI DELL AQUILA CENTRO LINGUISTICO DI ATENEO

Adult Ed ESL Standards

A deterministic fractal is an image which has low information content and no inherent scale.

ELECTRONIC COMMERCE OBJECTIVE QUESTIONS

Performance Indicators-Language Arts Reading and Writing 3 rd Grade

Syslog Windows Tool Set (WTS) Configuration File Directives And Help

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

This is a preview - click here to buy the full publication INTERNATIONAL STANDARD

GMAT.cz GMAT.cz KET (Key English Test) Preparating Course Syllabus

Gambling and Data Compression

Search and Information Retrieval

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Fuzzy Keyword Search over Encrypted Stego in Cloud

Problems with the current speling.org system

Keywords : complexity, dictionary, compression, frequency, retrieval, occurrence, coded file. GJCST-C Classification : E.3

A Mechanism for VHDL Source Protection

Handout 1. Introduction to Java programming language. Java primitive types and operations. Reading keyboard Input using class Scanner.

Diffusion and Data compression for data security. A.J. Han Vinck University of Duisburg/Essen April 2013

English Appendix 2: Vocabulary, grammar and punctuation

Overview/Questions. What is Cryptography? The Caesar Shift Cipher. CS101 Lecture 21: Overview of Cryptography

Binary Trees and Huffman Encoding Binary Search Trees

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

CHAPTER 5. Obfuscation is a process of converting original data into unintelligible data. It

Statistical Modeling of Huffman Tables Coding

Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems

Building with the 6 traits

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Introduction to image coding

Lossless Grey-scale Image Compression using Source Symbols Reduction and Huffman Coding

Language Lessons. Secondary Child

Determine two or more main ideas of a text and use details from the text to support the answer

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System

Meeting the Standard in North Carolina

LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5

Transcription:

Modern Information Retrieval Chapter 7: Text Processing Fall 2011 Information Retrieval Systems Saif Rababah 1

Overview 1. Document pre-processing 1. Lexical analysis 2. Stopword elimination 3. Stemming 4. Index-term selection 5. Thesauri 2. Text Compression 1. Statistical methods 2. Huffman coding 3. Dictionary methods 4. Ziv-Lempel compression Information Retrieval Systems Saif Rababah 2

Document Preprocessing Document pre-processing is the process of incorporating a new document into an information retrieval system. The goal is to Represent the document efficiently in terms of both space (for storing the document) and time (for processing retrieval requests) requirements. Maintain good retrieval performance (precision and recall). Document pre-processing is a complex process that leads to the representation of each document by a select set of index terms. However, some Web search engines are giving up on much of this process and index all (or virtually all) the words in a document). Information Retrieval Systems Saif Rababah 3

Document Preprocessing (cont.) Document pre-processing includes 5 stages: 1. Lexical analysis 2. Stopword elimination 3. Stemming 4. Index-term selection 5. Construction of thesauri Information Retrieval Systems Saif Rababah 4

Lexical analysis Objective: Determine the words of the document. Lexical analysis separates the input alphabet into Word characters (e.g., the letters a-z) Word separators (e.g., space, newline, tab) The following decisions may have impact on retrieval Digits: Used to be ignored, but the trend now is to identify numbers (e.g., telephone numbers) and mixed strings as words. Punctuation marks: Usually treated as word separators. Hyphens: Should we interpret pre-processing as pre processing or as preprocessing? Letter case: Often ignored, but then a search for First Bank (a specific bank) would retrieve a document with the phrase Bank of America was the first bank to offer its customers Information Retrieval Systems Saif Rababah 5

Stopword elimination Objective: Filter out words that occur in most of the documents. Such words have no value for retrieval purposes These words are referred to as stopwords. They include Articles (a, an, the, ) Prepositions (in, on, of, ) Conjunctions (and, or, but, if, ) Pronouns (I, you, them, it ) Possibly some verbs, nouns, adverbs, adjectives (make, thing, similar, ) A typical stopword list may include several hundred words. As seen earlier, the 100 most frequent words add-up to about 50% of the words in a document. Hence, stopword elimination improves the size of the indexing structures. Information Retrieval Systems Saif Rababah 6

Stemming Objective: Replace all the variants of a word with the single stem of the word. Variants include plurals, gerund forms (ing-form), third person suffixes, past tense suffixes, etc. Example: connect: connects, connected, connecting, connection, All have similar semantics and relate to a single concept. In parallel, stemming must be performed on the user query. Information Retrieval Systems Saif Rababah 7

Stemming (cont.) Stemming improves Storage and search efficiency: less terms are stored. Recall: without stemming a query about connection, matches only documents that have connection. With stemming, the query is about connect and matches in addition documents that originally had connects, connected, connecting, etc. However, stemming may hurt precision, because users can no longer target just a particular form. Stemming may be performed using Algorithms that stripe of suffixes according to substitution rules. Large dictionaries, that provide the stem of each word. Information Retrieval Systems Saif Rababah 8

Index term selection (indexing) Objective: Increase efficiency by extracting from the resulting document a selected set of terms to be used for indexing the document. If full text representation is adopted then all words are used for indexing. Indexing is a critical process: User's ability to find documents on a particular subject is limited by the indexing process having created index terms for this subject. Index can be done manually or automatically. Historically, manual indexing was performed by professional indexers associated with library organizations. However, automatic indexing is more common now (or, with full text representations, indexing is altogether avoided). Information Retrieval Systems Saif Rababah 9

Indexing (cont.) Relative advantages of manual indexing: Ability to perform abstractions (conclude what the subject is) and determine additional related terms, Ability to judge the value of concepts. Relative advantages of automatic indexing: Reduced cost: Once initial hardware cost is amortized, operational cost is cheaper than wages for human indexers. Reduced processing time Improved consistency. Controlled vocabulary: Index terms must be selected from a predefined set of terms (the domain of the index). Use of a controlled vocabulary helps standardize the choice of terms. Searching is improved, because users know the vocabulary being used. Thesauri can compensate for lack of controlled vocabularies. Information Retrieval Systems Saif Rababah 10

Indexing (cont.) Index exhaustivity: the extent to which concepts are indexed. Should we index only the most important concepts, or also more minor concepts? Index specificity: the preciseness of the index term used. Should we use general indexing terms or more specific terms? Should we use the term "computer", "personal computer", or Gateway E-3400? Main effect: High exhaustivity improves recall (decreases precision). High specificity improves precision (decreases recall). Related issues: Index title and abstract only, or the entire document? Should index terms be weighted? Information Retrieval Systems Saif Rababah 11

Indexing (cont.) Reducing the size of the index: Recall that articles, prepositions, conjunctions, pronouns have already been removed through a stopword list. Recall that the 100 most frequent words account for 50% of all word occurrences. Words that are very infrequent (occur only a few times in a collection) are often removed, under the assumption that they would probably not be in the user s vocabulary. Reduction not based on probabilistic arguments: Nouns are often preferred over verbs, adjectives, or adverbs. Information Retrieval Systems Saif Rababah 12

Indexing (cont.) Indexing may also assign weights to terms. Non-weighted indexing: No attempt to determine the value of the different terms assigned to a document. Not possible to distinguish between major topics and casual references. All retrieved documents are equal in value. Typical of commercial systems through the 1980s. Weighted indexing: Attempt made to place a value on each term as a description of the document. This value is related to the frequency of occurrence of the term in the document (higher is better), but also to the number of collection documents that use this term (lower is better). Query weights and document weights are combined to a value describing the likelihood that a document matches a query Information Retrieval Systems Saif Rababah 13

Thesauri Objective: Standardize the index terms that were selected. In its simplest form a thesaurus is A list of important words (concepts). For each word, an associated list of synonyms. A thesaurus may be generic (cover all of English) or concentrate on a particular domain of knowledge. The role of a thesaurus in information retrieval Provide a standard vocabulary for indexing. Help users locate proper query terms. Provide hierarchies for automatic broadening or narrowing of queries. Here, our interest is in providing a standard vocabulary (a controlled vocabulary). Essentially, in this final stage, each indexing term is replaced by the concept that defines its thesaurus class. Information Retrieval Systems Saif Rababah 14

Text Compression Data Encoding: Transform encoding units (characters, words, etc.) into code values. Objective is either Reduce size (compression) Hide contents (encryption). Lossless encoding: The transformation is reversible original file can be recovered from encoded (compressed, encrypted) file. Compression ratio: S: size of the uncompressed file. C: size of the compressed file. Compression-rate = C/S. Example: S= 300,000 bytes, C=100,000 bytes. Compression rate: 100,000/300,000 = 0.33. Information Retrieval Systems Saif Rababah 15

Text Compression (cont.) Advantages of compression: Reduction in storage size. Reduction in transmission time. Reduction in processing times (e.g., searching). Disadvantages: Requires time for compression/decompression. Processing of compressed text is more complex. Specific for information retrieval: Decompression time is often more critical than compression time. Unlike transmission-motivated compression (modems), documents in an information retrieval system are encoded once and decoded many times. Prefer compression techniques that allow searching in the compressed file (without decompressing it). Information Retrieval Systems Saif Rababah 16

Text compression (cont.) Basic methods: Statistical methods: Estimate the probability of occurrence of each encoding unit (character or word) in the alphabet. Assign codes to units: more frequent units are assigned shorter codes. In information retrieval, word-encoding is preferred over character encoding. Dictionary methods: Substitute a phrase (string of units) by a pointer to a dictionary or a previous occurrence of the phrase. Compression is achieved because the pointer is shorter than the phrase. Information Retrieval Systems Saif Rababah 17

Statistical methods Recall from the discussion of information theory: Assume a message from an alphabet of n symbols. Assume that the probability of the i th symbol is pi. The average information content (entropy) is: n E i 1 pi.log 2( pi) Optimal encoding is achieved when a symbol with probability pi is assigned a code whose length is log2(1/pi) = log2(pi). Hence, E also represents optimal average code length (measured in bits per character). Therefore, E is the lower bound on compression. Information Retrieval Systems Saif Rababah 18

Statistical methods (cont.) Statistical methods must first estimate the frequencies of the encoding units, and then assign codes based on these frequencies. Approaches: Static: Use a single distribution for all texts. Fast, but not optimal because different texts exhibit different distributions. The encoding table is stored in the application (not in the text). Decompression can start at any point in the file. Dynamic: Determine the frequencies in a preliminary pass. Excellent compression, but a total of two passes is required. The encoding table is stored at the beginning of the text. Decompression can start at any point in the file. Adaptive: Progressively learn the distribution of the text while compressing; each character is encoded on the basis of the preceding characters in a text. Fast, and close to optimal compression. Information Retrieval Systems Saif Rababah 19 Decompression must start from the beginning

Huffman coding General: Huffman coding is one of the best known compression techniques (1952). It is used in the Unix programs pack/unpack. It is a statistical method based on variable length codes. Compression is achieved by assigning shorter codes to more frequent units. Decompression is unique because no code is the prefix of another. Encoding units may be either bytes or words. Does not exploit the dependencies between the encoding units. Yields optimum average code length when these units are independent. Can be used with the static, dynamic and adaptive approaches. Information Retrieval Systems Saif Rababah 20

Huffman coding (cont.) Method: 1. Build a table of the encoding units and their frequencies (probabilities). 2. Combine the two least frequent units into a unit with the sum of the probabilities and encode it in a new unit. New unit: p1+p2 Unit1: p1 Unit2: p2 3. Repeat this process until the entire dictionary is represented by a root whose probability is 1.0. 4. When there is a tie for the two least frequent units, any tiebreaking procedure is acceptable. Information Retrieval Systems Saif Rababah 21

Huffman coding (cont.) Example: Information Retrieval Systems Saif Rababah 22

Example (cont.): The resulting code: Average code length: Huffman coding (cont.) 10 i 1 The entropy (compression lower bound) is: pi. li 3.05bits Fixed code length would have required log2 10 = 3.32 bits (which, in practice, would require 4 bits). Compression ratio: C/S = 3.05/3.32 = 0.92 Information Retrieval Systems Saif Rababah 23

Huffman coding (cont.) Example: When the letters A-Z are thus encoded: Code lengths are between 3 and 10 bits. Average code length is 4.12 bits. A fixed code would have required log2 26 = 4.70 bits (i.e., 5 bits). More compression is obtained by encoding words: With the 800 most frequent English words (small table!) are encoded in this method (all other words are in plain ASCII), 40-50% compression has been reported. Huffman codes are prefix-specific: No code is the beginning of another code. Hence, a left-to-right decoding operation is unique. It is possible to search the compressed text. Information Retrieval Systems Saif Rababah 24

Dictionary methods Dictionary methods construct a dictionary of phrases, and replace their occurrences with dictionary pointers. The choice of phrases may be static, dynamic or adaptive. A simple method (digrams): Construct a dictionary of pairs of letters that occur together frequently (e.g., ou, ea, ch, ). If n such pairs are used, a pointer (location in the dictionary) requires log2 n bits. At each step in the encoding, the next pair is examined. If it corresponds to a dictionary pair, it is replaced by its encoding, and the encoding position moves by 2 characters. Otherwise, the single character encoding is kept, and the position moves by one character. To assure that decoding is unambiguous, an extra bit is needed to indicate whether the next unit is a single character code or a digram code. Information Retrieval Systems Saif Rababah 25

Ziv-Lempel compression General: The Ziv-Lempel method (1977) uses a single-pass adaptive scheme. While compressing, it constructs a dictionary from phrases encountered so far. Many popular programs (Unix compress/uncompress, GNU gzip/gunzip, and Windows WinZip) are based on the Ziv-Lempel algorithm. Compression is slightly better than Huffman codes (C/S of 45% vs. 55%). Disadvantage for information retrieval: decompressed file cannot be searched and decoding cannot start at a random place in the file. Information Retrieval Systems Saif Rababah 26

Ziv-Lempel compression (cont.) Compression: 1. Initialize the dictionary to contain all phrases of length one. 2. Examine the input stream and search for the longest phrase which has appeared in the dictionary. 3. Encode this phrase by its index in the dictionary. 4. Add the phrase followed by the next symbol in the input stream to the dictionary. 5. Go to Step 2. Information Retrieval Systems Saif Rababah 27

Example: Ziv-Lempel compression (cont.) Assume a dictionary of 16 phrases (4 bit index). This case does not result in compression. Source: 25 characters in a 2- character alphabet require a total of 25 bits. Output: 13 pointers of 4 bits require a total of 52 bits. This is because the length of the input data in this example is too short. In practice, the Lempel-Ziv algorithm works well only when the input data is sufficiently large and there is sufficient redundancy in the data. Information Retrieval Systems Saif Rababah 28