Eng. Mohammed Abdualal



Similar documents
Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Introduction to Information Retrieval

Introduction to Information Retrieval

CSCI 5417 Information Retrieval Systems Jim Martin!

1 o Semestre 2007/2008

w ki w kj k=1 w2 ki k=1 w2 kj F. Aiolli - Sistemi Informativi 2007/2008

Lecture 5: Evaluation

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

More on annuities with payments in arithmetic progression and yield rates for annuities

Finding Advertising Keywords on Web Pages. Contextual Ads 101

An Information Retrieval using weighted Index Terms in Natural Language document collections

Mining Text Data: An Introduction

Linear Algebra Methods for Data Mining

Information Retrieval Elasticsearch

Methods for Finding Bases

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Lab # 5. Retreiving Data from Multiple Tables. Eng. Alaa O Shama

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Search Engines. Stephen Shaw 18th of February, Netsoc

Least-Squares Intersection of Lines

Search and Information Retrieval

The PageRank Citation Ranking: Bring Order to the Web

RSA Question 2. Bob thinks that p and q are primes but p isn t. Then, Bob thinks Φ Bob :=(p-1)(q-1) = φ(n). Is this true?

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

LINEAR ALGEBRA W W L CHEN

Information Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007

Math 312 Homework 1 Solutions

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Vector Spaces; the Space R n

Finance 197. Simple One-time Interest

Settling Multiple Debts Efficiently: Problems

Orthogonal Projections

A Generalized Framework of Exploring Category Information for Question Retrieval in Community Question Answer Archives

Solving Systems of Linear Equations

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Recall that two vectors in are perpendicular or orthogonal provided that their dot

α = u v. In other words, Orthogonal Projection

8-3 Dot Products and Vector Projections

160 CHAPTER 4. VECTOR SPACES

Development of Secure Multikeyword Retrieval Methodology for Encrypted Cloud Data

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

Math 215 HW #6 Solutions

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

Second Order Linear Nonhomogeneous Differential Equations; Method of Undetermined Coefficients. y + p(t) y + q(t) y = g(t), g(t) 0.

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Why is Internal Audit so Hard?

Factoring. Factoring 1

By reversing the rules for multiplication of binomials from Section 4.6, we get rules for factoring polynomials in certain forms.

9 Multiplication of Vectors: The Scalar or Dot Product

CURVE FITTING LEAST SQUARES APPROXIMATION

Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value

How to pull content from the PMP into Core Publisher

Intelligent Learning Content Management System based on SCORM Standard. Dr. Shian-Shyong Tseng

BALTIC OLYMPIAD IN INFORMATICS Stockholm, April 18-22, 2009 Page 1 of?? ENG rectangle. Rectangle

5 Homogeneous systems

Tutorial #7A: LC Segmentation with Ratings-based Conjoint Data

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Term extraction for user profiling: evaluation by the user

The University of Lisbon at CLEF 2006 Ad-Hoc Task

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Traditional Conjoint Analysis with Excel

Optimization of Internet Search based on Noun Phrases and Clustering Techniques

Secure semantic based search over cloud

STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

EXAMINATIONS OF THE HONG KONG STATISTICAL SOCIETY

Level Set Framework, Signed Distance Function, and Various Tools

Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = i.

Data Storage 3.1. Foundations of Computer Science Cengage Learning

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

is in plane V. However, it may be more convenient to introduce a plane coordinate system in V.

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Sales and operations planning (SOP) Demand forecasting

Ranked Search over Encrypted Cloud Data using Multiple Keywords

Probability Distributions

Breaking The Code. Ryan Lowe. Ryan Lowe is currently a Ball State senior with a double major in Computer Science and Mathematics and

Dynamics of Genre and Domain Intents

Section 1.1. Introduction to R n

A Hybrid Model of Data Mining and MCDM Methods for Estimating Customer Lifetime Value. Malaysia

Department of Chemical Engineering ChE-101: Approaches to Chemical Engineering Problem Solving MATLAB Tutorial VI

Facilitating Efficient Encrypted Document Storage and Retrieval in a Cloud Framework

Information Retrieval and Web Search Engines

Approaches to Exploring Category Information for Question Retrieval in Community Question-Answer Archives

Creating a Participants Mailing and/or Contact List:

Transcription:

Islamic University of Gaza Faculty of Engineering Computer Engineering Department Information Storage and Retrieval (ECOM 5124) IR HW 5+6 Scoring, term weighting and the vector space model Exercise 6.2 Eng. Mohammed Abdualal In Example 6.1 above with weights g1 = 0.2, g2 = 0.31 and g3 = 0.49, what are all the distinct score values a document may get? 0.2 author zone, 0.31 title zone, 0.49 body zone 1 if appear in all zones 0.51 if appear in author and title zones but not body 0.69 if appear in author and body zones but not title 0.8 if appear in title and body zones but not author

Exercise 6.5 Apply Equation 6.6 to the sample training set in Figure 6.5 to estimate the best value of g for this sample. Optimal g* is g* = (n 10r + n 01n ) / (n 10r + n 10n + n 01r + n 01n ) where n_01r is the number of training samples for which s_t(d_j, q_j) = 0 and s_b(d_j, q_j) = 1, i.e., the sample is labeled relevant and the term occurs in the body, but not the title, and n_10n is a labeled nonrelevant with the term occurring in the title and not the body. n 10r = 0 n 10n = 1 n 01r = 2 n 01n = 1 g* = 1 / 4 = 0.25 Exercise 6.6 For the value of g estimated in Exercise 6.5, compute the weighted zone score for each (query, document) example. How do these scores relate to the relevance judgments in Figure 6.5 (quantized to 0/1)? phi score Relevance judgment ---------------------------------------------- 1 1 R 2 3/4 NR 3 3/4 R 4 0 NR 5 1 R 6 3/4 R 7 1/4 NR Non-relevant seem to yield smaller scores.

Exercise 6.7 Why does the expression for g in (6.6) not involve training examples inwhich st(dt, qt) and sb(dt, qt) have the same value? score(d, q) = g * s T (d, q) + (1 g) * s B (d, q) Because when they have the same value it does not affect the value of g (not dependents) so in differential equation will removed. And the total error when apply the equation 6.4 are 0 for both. Exercise 6.9 What is the idf of a term that occurs in every document? Compare this with the use of stop word lists. IDF is zero. (IDF = log N/N = log 1 = 0) Like stop words occurs in every document but it not related to document. Exercise 6.10 Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in Figure 6.9. Compute the tf-idf weights for the terms car, auto, insurance, best, for each document, using the idf values from Figure 6.8. tf-idf = tf * idf terms Doc1 Doc2 Doc3 -------------------------------------------- Car 44.55 6.6 39.6 Auto 6.24 68.64 0 Insurance 0 53.46 46.98 Best 21 0 25.5

Exercise 6.11 Can the tf-idf weight of a term in a document exceed 1? Yes... Exercise 6.15 Recall the tf-idf weights computed in Exercise 6.10. Compute the Euclidean normalized document vectors for each of the documents, where each vector has four components, one for each of the four terms. car Auto Ins. best ---------------------------------- Doc1 = (0.897,0.125,0.000,0.423) Doc2 = (0.076,0.786,0.613,0.000) Doc3 = (0.595,0.000,0.706,0.383) Exercise 6.16 Verify that the sum of the squares of the components of each of the document vectors in Exercise 6.15 is 1 (to within rounding error). Why is this the case? Doc1 = 0.8972^2 + 0.1252^2 + 0.4232^2 = 0.999; Doc2 = 0.0762^2 + 0.7862^2 + 0.613^2 = 0.999; Doc3 = 0.5952^2 + 0.7062^2 + 0.3832^2 = 0.999 Because they're normalized (unit) vectors, and the length of unit vectors is 1

Exercise 6.17 With term weights as computed in Exercise 6.15, rank the three documents by computed score for the query car insurance, for each of the following cases of term weighting in the query: 1. The weight of a term is 1 if present in the query, 0 otherwise. 2. Euclidean normalized idf. 1. p=q d score(q,d)=p t1 +p t2 +p t3 +p t4 2.

Exercise 6.19 Compute the vector space similarity between the query digital cameras and the document digital cameras and video cameras by filling out the empty columns in Table 6.1. Assume N = 10,000,000, logarithmic term weighting (wf columns) for query and document, idf weighting for the query only and cosine normalization for the document only. Treat and as a stop word. Enter term counts in the tf columns. What is the final similarity score?

Exercise 6.20 Show that for the query affection, the relative ordering of the scores of the three documents in Figure 6.13 is the reverse of the ordering of the scores for the query jealous gossip. For the query affection, score(q, SaS) = 0.996, score(q, PaP) = 0.993, score(q, WH) = 0.847, so the order is SaS, PaP, WH. For the query jealous gossip(sum), score(q, SaS) = 0.104, score(q, PaP) = 0.12, score(q,wh) = 0.72, so the order is WH, PaP,SaS. So the latter case is the reverse order of the former case. Exercise 6.23 Refer to the tf and idf values for four terms and three documents in Exercise 6.10. Compute the two top scoring documents on the query best car insurancefor each of the following weighing schemes: (i) nnn.atc; (ii) ntc.atc. (i)