w ki w kj k=1 w2 ki k=1 w2 kj F. Aiolli - Sistemi Informativi 2007/2008



Similar documents
TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Section 1.1. Introduction to R n

Eng. Mohammed Abdualal

Math 241, Exam 1 Information.

Support Vector Machines Explained

Big Data Analytics CSCI 4030

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Statistical Models in Data Mining

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis

Numerical Analysis Lecture Notes

Introduction to Support Vector Machines. Colin Campbell, Bristol University

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

A Figure A: Maximum circle of compatibility for position A, related to B and C

Discrete Optimization

Spazi vettoriali e misure di similaritá

Vector Spaces; the Space R n

Thnkwell s Homeschool Precalculus Course Lesson Plan: 36 weeks

3. INNER PRODUCT SPACES

Introduction to Information Retrieval

Mathematics (MAT) MAT 061 Basic Euclidean Geometry 3 Hours. MAT 051 Pre-Algebra 4 Hours

Geometry of Vectors. 1 Cartesian Coordinates. Carlo Tomasi

Prentice Hall Algebra Correlated to: Colorado P-12 Academic Standards for High School Mathematics, Adopted 12/2009

9 Multiplication of Vectors: The Scalar or Dot Product

discuss how to describe points, lines and planes in 3 space.

with functions, expressions and equations which follow in units 3 and 4.

Math 241 Lines and Planes (Solutions) x = 3 3t. z = 1 t. x = 5 + t. z = 7 + 3t

Computer Graphics. Geometric Modeling. Page 1. Copyright Gotsman, Elber, Barequet, Karni, Sheffer Computer Science - Technion. An Example.

Name Class. Date Section. Test Form A Chapter 11. Chapter 11 Test Bank 155

1 VECTOR SPACES AND SUBSPACES

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Support Vector Machines with Clustering for Training with Very Large Datasets

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Precalculus REVERSE CORRELATION. Content Expectations for. Precalculus. Michigan CONTENT EXPECTATIONS FOR PRECALCULUS CHAPTER/LESSON TITLES

Lecture 14: Section 3.3

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Ranking on Data Manifolds

Essential Mathematics for Computer Graphics fast

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Figure 1.1 Vector A and Vector F

How To Prove The Dirichlet Unit Theorem

1 o Semestre 2007/2008

Predictive Dynamix Inc

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

COMBINING THE METHODS OF FORECASTING AND DECISION-MAKING TO OPTIMISE THE FINANCIAL PERFORMANCE OF SMALL ENTERPRISES

Linear functions Increasing Linear Functions. Decreasing Linear Functions

Towards Online Recognition of Handwritten Mathematics

Figure 2.1: Center of mass of four points.

For example, estimate the population of the United States as 3 times 10⁸ and the

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Solving Simultaneous Equations and Matrices

Visualization by Linear Projections as Information Retrieval

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

4.1 Learning algorithms for neural networks

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Chapter ML:XI (continued)

Support Vector Machine (SVM)

Nonlinear Iterative Partial Least Squares Method

11.1. Objectives. Component Form of a Vector. Component Form of a Vector. Component Form of a Vector. Vectors and the Geometry of Space

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

A Relevant Document Information Clustering Algorithm for Web Search Engine

Social Business Intelligence Text Search System

Interaction effects between continuous variables (Optional)

1 Sets and Set Notation.

Curves and Surfaces. Goals. How do we draw surfaces? How do we specify a surface? How do we approximate a surface?

The Open University s repository of research publications and other research outputs

LCs for Binary Classification

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Two vectors are equal if they have the same length and direction. They do not

Active Learning SVM for Blogs recommendation

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

A Simple Introduction to Support Vector Machines

South Carolina College- and Career-Ready (SCCCR) Pre-Calculus

Chapter 7: Simple linear regression Learning Objectives

Dot product and vector projections (Sect. 12.3) There are two main ways to introduce the dot product

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

ISOMETRIES OF R n KEITH CONRAD

FACTORING ANGLE EQUATIONS:

A vector is a directed line segment used to represent a vector quantity.

MAT 1341: REVIEW II SANGHOON BAEK

Information Retrieval and Web Search Engines

1. Domain Name System

Mining and Tracking Evolving Web User Trends from Large Web Server Logs

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Applied Algorithm Design Lecture 5

Linear Algebra: Vectors

Big Data and Scripting map/reduce in Hadoop

Exam 1 Sample Question SOLUTIONS. y = 2x

Common Core Unit Summary Grades 6 to 8

SOLUTIONS. f x = 6x 2 6xy 24x, f y = 3x 2 6y. To find the critical points, we solve

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

Foundations of Data Science 1

13.4 THE CROSS PRODUCT

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Transcription:

RSV of the Vector Space Model The matching function RSV is the cosine of the angle between the two vectors RSV(d i,q j )=cos(α)= n k=1 w kiw kj d i 2 q j 2 = n k=1 n w ki w kj n k=1 w2 ki k=1 w2 kj 8 Note that Given two vectors (either docs or queries), their similarity is determined by their direction and verse but not their distance! You can think of them as lying on the sphere of unit radius The following three option returns the same value Cosine of two generic vectors x 1 and x 2 Cosine of two vectors v(x 1 ) and v(x 2 ), where v(x) is a length-normalized vector of x The inner product of vectors v(x 1 ) and v(x 2 ) 9

Normalized vectors For normalized vectors, the cosine is simply the dot product: r cos( d j r, d k r ) = d j r d k 10 Exercise Euclidean distance between vectors: d j d k = n i = 1 ( d d ) i, j i, k Show that, for normalized vectors, Euclidean distance gives the same proximity ordering as the cosine measure 2 11

12 The effect of normalization From the p.o.v. of ranking, the three previous options are also equivalent to computing the (inverse of) distance between v(x 1 ) and v(x 2 ) Normalization implies that the RSV computes a qualitative similarity (what kind of information they contain) instead of quantitative (how much information they contain). It has experimentally be proven to yield improved results We can verify that in the VSM, RSV(d,d)=1 (ideal document property). Boolean, Fuzzy Logic, and other models also have this property 13

Example Docs: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 SaS PaP WH affection 0,996 0,993 0,847 jealous 0,087 0,120 0,466 gossip 0,017 0,000 0,254 cos(sas, PAP) =.996 x.993 +.087 x.120 +.017 x 0.0 = 0.999 cos(sas, WH) =.996 x.847 +.087 x.466 +.017 x.254 = 0.889 14 Advantages of the VSM 1. Flexibility. The most decisive factor in imposing VSM. The same intuitive geometric interpretation has been re-applied, apart from relevance feedback, in different contexts Automatic document categorization Automatic document filtering Document clustering Term-term similarity computation (terms are indexed by documents, dual) 15

2. Possibility to attribute weights to the IREPs of both documents and queries thus allowing automatically indexed document bases 3. Ranked output and output magnitude control 4. Automatic query acquisition 5. No output flattening ; each query term contributes to the ranking in an equal way, depending on its weights 6. Much better effectiveness than the BM and the FLM 16 Disadvantages of the VSM Impossibility of formulating structured queries: there are no operators Or (synonyms or quasi-synonyms) And (compulsory occurrence of index terms) Not (compulsory absence of index terms) Prox (specification of noun phrases) The VSM is based on the hypothesis that the terms are pairwise stochastically independent (binary independence hypothesis). A more recent extension of the VSM relaxes this hypothesis, allowing the Cartesian axes to be non-orthogonal 17

On Similarity Measures Any two arbitrary objects are equally similar unless we use domain knowledge! Inner Product Jaccard and Dice Similarity Overlap Coefficient Conversion from a distance Kernel Functions (beyond vectors) 18 Inner Product Dice Jaccard Overlap Coef. Binary case d i q j 2 d i q j d i + q j d i q j d i q j d i q j min( d i, q j ) Non-binary case d i q j 2d i q j d i 2 + q j 2 d i q j d i 2 + q j 2 d i q j d i q j min( d i 2, q j 2 ) 19

Conversion from a distance Minkowsky Distances L p (x,z)=( n i x i z i p ) 1 p When p =, L = max i ( x i -z i ) A similarity measure taking values in [0,1] can always be defined as s p,λ (x,z)=e λl p(x,z) Where λ (0,+ ) is a constant parameter 20 Kernel functions A kernel function K(x,z) is a (generally non-linear) function which corresponds to an inner product in some expanded feature space, i.e. K(x,z)=φ(x) φ(z) Example: For 2-dimensional spaces x=(x 1,x 2 ) is a kernel where K(x,z)=(1+x y) 2 φ(x)=(1,x 2 1, 2x 1 x 2,x 2 2, 2x 1, 2x 2 ) 21

Polynomial Kernels: RBF Kernels K(x,z)=(x z+u) p Other Kernels K(x,z)=e λ x z 2 Also, there are kernels for structured data: strings, sequences, trees, graphs, an so on 22 Polynomial kernels allows to model features which are conjunctions (up to the order of the polynomial) Radial Basis Function kernels is equivalent to map into an infinite dimensional Hilbert space A string kernel allows to have features that are subsequences of words 23