Text Analytics. Modeling Information Retrieval 2. Ulf Leser

Text Analytics Moeling Information etrieval 2 Ulf Leser

Content of this Lecture I Moels Boolean Moel Vector Sace Moel elevance Feebac in the VSM Probabilistic Moel Latent Semantic Inexing Other I Moels Ulf Leser: Text Analytics, Winter Semester 2010/2011 2

A Probabilistic Interretation of elevance We want to comute the robability that a oc is relevant to uery The robabilistic moel etermines this robability iteratively using user or automatic feebac Similar to VSM with relevance feebac But ifferent an more rincile way of incororating feebac Assume there is a subset D which contains all an only relevant ocuments for, an =D\ For each ocument, we want to comute the robability that belongs to for This is base on the wors in, i.e., we reresent as the set of wors containe in ={ i } Ulf Leser: Text Analytics, Winter Semester 2010/2011 3

Ulf Leser: Text Analytics, Winter Semester 2010/2011 4 A Probabilistic Interretation of elevance Since wors i aear both in relevant an in irrelevant ocs, we nee to loo at the influence of both We use os-scores Assuming statistical ineenence of wors, we get Clearly wrong, ~, sim rel = *...* *...*,...,,...,, ~, 1 1 1 1 n n n n sim rel = =

Binary Ineenence Moel I Using Bayes Theorem sim, = = * * * * ~ : relative freuency of relevant irrelevant ocs in D A-Priori robability of a oc to be ir-relevant an are ineenent from an thus, both are constant for an irrelevant for raning ocuments is the robability of rawing the combination of wors forming when rawing wors at ranom from Ulf Leser: Text Analytics, Winter Semester 2010/2011 5

Ulf Leser: Text Analytics, Winter Semester 2010/2011 6 Binary Ineenence Moel II : Drawing the wors from means two things Drawing the wors from, an not rawing the wors not in BI consiers both lus ineenence of terms Proerties Having wors that are freuent in raises the similarity to ot having wors that are freuent in raises the similarity to Why is no in the formula? = = sim * *,

Ulf Leser: Text Analytics, Winter Semester 2010/2011 7 Continuation ehrasing using Focusing on the uery terms In a real setting we are not sure about an give less weight to terms not in the uery Drastically increases erformance = sim \ \ \ \ * * *, = \ \ 1 1 * *...

Ulf Leser: Text Analytics, Winter Semester 2010/2011 8 Last Ste Some reformulating ulicating the terms in \ 1 1 * = = 1 1 * *1 *1 1 1 * *1 *1 *1 *1 \ All uery terms All matching terms All matching terms All non-matching terms

Continuation 2 Obviously, the last term is ientical for all ocs. Thus sim, *1 *1 sim, = robability of a ocument comrising the terms of being relative to uery But: Comuting sim, reuires nowlege of an If an were nown, we can estimate / using maximum lielihoo This means: Comuting relative freuencies of terms in / In reality, we actually want to fin an Ulf Leser: Text Analytics, Winter Semester 2010/2011 9

Bac to eality Iea: Aroximation using an iterative rocess Start with some eucate guess for an set =D\ E.g. retrieve all ocs containing at least one wor from Comute robabilistic raning of all ocs wrt base on first guess Here it is imortant to focus on terms in Chose relevant ocs by user feebac or hoefully relevant ocs by selecting the to-r ocs This gives new sets an If to-r ocs are chosen, we may chose to only change robabilities of terms in an isregar the uestionable negative information Comute new term scores an new raning Iterate until satisfie [Variant of the Exectation Maximization Algorithm EM] Ulf Leser: Text Analytics, Winter Semester 2010/2011 10

Initialization Tyical simlifying assumtions for the start Terms in non-relevant ocs are eually istribute: ~f /D is constant, e.g., =0.5 Much less comutation, less weight to resumably unstable first values Iterations: Assume we have a new / P = {, } P = f { D, } Ulf Leser: Text Analytics, Winter Semester 2010/2011 11

Examle Data Text verauf haus italien gart miet blüh woll 1 Wir veraufen Häuser in Italien 2 Häuser mit Gärten zu vermieten 3 Häuser: In Italien, um Italien, um Italien herum 4 Die italienschen Gärtner sin im Garten 5 Um unser italiensches Haus blüht s 1 1 1 1 1 1 1 1 1 1 1 1 1 6 Wir veraufen Blühenes 1 1 Q Wir wollen ein Haus mit Garten in Italien mieten 1 1 1 1 1 Ulf Leser: Text Analytics, Winter Semester 2010/2011 12

Examle: Initialization sim, All ocs with at least on wor from ={1,2,3,4,5}, ={6} Start with initial estimations =0.5, = f /D -> verauf=blüh=2/6 Smoothing: If X=0, set X=0.01 Comute initial raning *1 *1 V H I G M B W 1 1 1 1 2 1 1 1 3 1 1 4 1 1 5 1 1 1 6 1 1 Q 1 1 1 1 1 sim1,= haus*1-haus*italien*1-italien / haus*1-haus*italien*1-italien =.5*1-0.01*.5*1-0.01 / 0.01*1-0.5*0.01*1-0.5= = 9801 sim2,= 970299 sim3,= sim4, = sim5, = 9801 sim6,= 0 Ulf Leser: Text Analytics, Winter Semester 2010/2011 13

Examle: Ajustment {, } P = f P = {, } D V H I G M B W 1 1 1 1 2 1 1 1 3 1 1 Let s use the to-2 ocs as new Secon chosen arbitrarily among 1,3,4,5 ={1,2}, ={3,4,5,6} Ajust scores 4 1 1 5 1 1 1 6 1 1 Q 1 1 1 1 1 verauf=.5, verauf=2-1/6-2=1/4 haus=1~.99, haus=4-2/6-2=2/4 italien=.5, italien=4-1/6-2=3/4 gart=.5, gart=2-1/6-2=1/4 miet=.5, miet=1-1/6-2=0~0.01 Ulf Leser: Text Analytics, Winter Semester 2010/2011 14

Examle: e-aning sim, *1 *1 V H I G M B W 1 1 1 1 2 1 1 1 3 1 1 4 1 1 5 1 1 1 6 1 1 Q 1 1 1 1 1 ew raning sim1,= haus*1-haus*italien*1-italien haus*1-haus*italien*1-italien = sim2,= Ulf Leser: Text Analytics, Winter Semester 2010/2011 15

Pros an Cons Avantages Soun robabilistic framewor ote that VSM is strictly heuristic what is the justification for those istance measures? esults converge to most robable ocs Uner the assumtion that relevant ocs are similar by sharing term istributions that are ifferent from istributions in irrelevant ocs Disavantages First guesses often are retty ba slow convergence Terms cannot be weighte w ij {0,1} Assumes statistical ineenence of terms as most methos Has never wore convincingly better in ractice [MS07] Ulf Leser: Text Analytics, Winter Semester 2010/2011 16

Probabilistic Moel versus VSM with el. Feebac Publishe 1990 by Salton & Bucley Comarison base on various corora Imrovement after 1 feebac iteration Probabilistic moel BI in general worse than VSM+rel feebac IDE Probabilistic moel oes not weight terms in ocuments Probabilistic moel oes not allow to weight terms in ueries Ulf Leser: Text Analytics, Winter Semester 2010/2011 17

Latent Semantic Inexing We so-far ignore semantic relationshis between terms Homonyms: ban money, river Synonyms: House, builing, hut, villa, Hyeronyms: officer lieutenant Iea of Latent Semantic Inexing LSI Deerwester, S., Dumais, S. T., Furnas, G. W., Lanauer, T. K. an Harshman,. 1990. "Inexing by latent semantic analysis." Journal of the American society for information science 416: 391-407. >5000 citations! Ma many terms into fewer semantic concets Which are hien or latent in the ocs Comare ocs an uery in concet sace instea of term sace One big avantage: Can fin ocs that on t even contain the uery terms Ulf Leser: Text Analytics, Winter Semester 2010/2011 19

Terms an Concets Quelle: K. Aberer, I Concets are more abstract than terms Concets are more or less relate to terms an to ocs LSI fins concets automatically by matrix maniulations A concet will be a set of freuently co-occurring terms Concets from LSI cannot be selle out, but are matrix columns Ulf Leser: Text Analytics, Winter Semester 2010/2011 20

Term-Document Matrix Definition The term-ocument matrix M for ocs D an terms K has n=d columns an m=k rows. M[i,j]=1 iff ocument j contains term i. Wors eually well for TF or TF*IDF values Ulf Leser: Text Analytics, Winter Semester 2010/2011 21

Term-Document Matrix an VSM The matrix we use in VSM was a transose ocumentterm matrix =M t Having M, we can comute the vector v containing the VSM-scores of all ocs given as v=m t Ignoring score normalization Ulf Leser: Text Analytics, Winter Semester 2010/2011 22

What to o with a Term-Document Matrix M is not just a comfortable way of reresenting the term vectors of all ocuments M is a matrix Linear Algebra offers many ways to maniulate matrices In the following, we aroximate M by a M M shoul be smaller than M in a certain sense Less imensions; faster comutations M shoul abstract from terms to concets The less imensions cature the least freuent co-occurrences M shoul be such that M t * M t * Prouce the least error among all M of the same imension ote: We only setch LSI Ulf Leser: Text Analytics, Winter Semester 2010/2011 23

Term an Document Correlation M M t is calle the term correlation matrix Has K columns an K rows Similarity of terms: how often o they co-occur in a oc? M t M is calle the ocument correlation matrix Has D columns an D rows Similarity of ocs: how many terms o they share? Examle 1 2 3 4 5 A 1 1 1 B 1 1 1 1 C 1 1 D 1 1 A B C D 1 1 1 2 1 1 1 3 1 1 1 4 1 5 1 1 = A B C D A 3 3 2 0 B 3 4 2 1 C 2 2 2 0 D 0 1 0 2 M M t Term correlation matrix Ulf Leser: Text Analytics, Winter Semester 2010/2011 24

Some Lineare Algebra to emember Let M be a matrix The ran of M r is the maximal number of linear ineenent rows of M its imension If we have Mλ-λx=0 for x 0, then λ is calle an Eigenwert of M an x is his associate Eigenvector Eigenvectors/-werte are useful for many things In articular, one can show that a matrix M can be transforme into a iagonal matrix L with L=U -1 *M*U with U forme from the Eigenvectors of M, but only iff M has enough Eigenvectors Such L is calle similar to M; L reresents M in another vector sace, base on another basis L can be use in many cases instea of M an is easier to hanle However, our M usually will not have enough Eigenvectors Ulf Leser: Text Analytics, Winter Semester 2010/2011 25

Singular Value Decomosition SVD SVD is a metho to ecomose any matrix in the following way: M = X S Y t S is the iagonal matrix of the singular values of M in escening orer an has size rxr X is the matrix of Eigenvectors of M M t Y is the matrix of Eigenvectors of M t M This ecomosition is uniue an can be comute in Or 3 n=d r r n=d m=k M = X S Y t r Ulf Leser: Text Analytics, Winter Semester 2010/2011 26

Examle Assume for now M is uaratic an has full ran Examle for r=k=d=3 M 11 M 12 M 13 x 11 s 11 0 0 M 21 = 0 s 22 0 M 31 M 33 x 33 0 0 s 33 y 11 y 33 M 11 = x 11 *s 11 +x 12 *s 12 +x 13 *s 13 *y 11 + x 11 *s 21 +x 12 *s 22 +x 13 *s 23 *y 21 + x 11 *s 31 +x 12 *s 32 +x 13 *s 33 *y 31 = x 11 *s 11 *y 11 + x 12 *s 22 *y 21 + x 13 *s 33 *y 31 M 12 =... Ulf Leser: Text Analytics, Winter Semester 2010/2011 27

General Case M not uaratic; r < mink, All sums range from 1 to r Σ X 1i S ii Y i1 Σ X 1i S ii Y im r m=d 0 0 = 0 0 0 0 n=k Σ X ni S ii Y im Σ X ni S ii Y im LSI iea: What if we sto the sums earlier, at some s<r? s ii are sorte by escening value Aggregating only over the first s s ii values catures most of M Ulf Leser: Text Analytics, Winter Semester 2010/2011 28

Aroximating M S can be use to aroximate M Fix some s<r; Comute M s = X s S s Y t s X s : First s columns in X S s : First s columns an first s rows in S Y s : First s rows in Y M s has the same size as M, but ifferent values For LSI, we on t nee to comute M s, but only nee X s, S s an Y s s s M s = X s S S s Y s t s Ulf Leser: Text Analytics, Winter Semester 2010/2011 29

s-aroximations Since the s ii are sorte in ecreasing orer The aroximation is the better, the larger s The comutation is the faster, the smaller s LSI: Only consier the to-s singular values s must be small enough to filter out noise an to rovie semantic reuction s must be large enough to reresent the iversity in the ocuments Tyical value: 200-500 Otimality: After SVD, M is the matrix where M-M 2 is the smallest Ulf Leser: Text Analytics, Winter Semester 2010/2011 30

Geometric Interretation of SVD M is a linear transosition in a vector sace X,Y can be seen as coorination transformations, S is a linear scaling X transforms M into a vector sace where its transosition can be reresente as a linear scaling an Y transforms it bac into the vector sace of M s-aroximation M is transforme into a vector sace of lower imension such that the new imensions cature the most imortant variations in M Distances between vectors are reserve as much as ossible Universal metho: LSI has many more alications than I Ulf Leser: Text Analytics, Winter Semester 2010/2011 31

LI for Information etrieval We ma ocument vectors from a m-imensional sace into a s- imensional sace Aroximate ocs still are reresente by columns in Y t s Variations between ocument vectors are etermine by the number of terms they have in common The more terms in common, the smaller the istance SVD tries to reserve these istances To this en, it in a way mas freuently co-occurring terms to the same imensions Because freuently co-occurring terms have little imact on istance Freuently co-occurring terms can be interrete as concets But they cannot easily be name Also, we cannot simly etermine the terms that are mae into a new imension it is always a bit of everything a linear combination Ulf Leser: Text Analytics, Winter Semester 2010/2011 32

Query Evaluation After LSI, ocs are reresente by columns in Y s t How can we comute the istance between a uery an a oc in concet sace? We first nee to reresent in concet sace Assume as a new column in M Of course, we can transform M offline, but nee to transform online This woul generate a new column in Y s t To only comute this column, we aly the same transformations to as we i to all other columns of M With a little algebra, we get: = t X s S s -1 This vector is comare to the oc vectors as usual Ulf Leser: Text Analytics, Winter Semester 2010/2011 33

Examle: Term-Document Matrix Taen from Mi Islita: Tutorials on SVD & LSI htt://www.miislita.com/information-retrieval-tutorial/sv-lsitutorial-1-unerstaning.html Who too if from the Grossman an Frieer boo Query: gol silver truc M Ulf Leser: Text Analytics, Winter Semester 2010/2011 34

Singular Value Decomosition M = X S Y t X S Y Y t Ulf Leser: Text Analytics, Winter Semester 2010/2011 35

A Two-Aroximation s=2 X 2 S 2 Y 2 Y 2 t 1 2 3 Ulf Leser: Text Analytics, Winter Semester 2010/2011 36

Transforming the Query = t X 2 S 2-1 Ulf Leser: Text Analytics, Winter Semester 2010/2011 37

Comuting the Cosine of the Angle Ulf Leser: Text Analytics, Winter Semester 2010/2011 38

Visualization of esults in 2D M Ulf Leser: Text Analytics, Winter Semester 2010/2011 39

Pros an Cons Pro Mae it into ractice, use by real search engines See-u through comutation with less imensions Increases recall an usually ecreases recision Contra Comuting SVD is exensive Fast aroximations exist, esecially for extremely sarse matrices Use stemming, sto-wor removal etc. to shrin the original matrix aning reuires less imensions than D, but more than Every uery nees to be mae first turns a few eywors into a s- imensional vector We cannot simly inex the concets of M s using inverte files etc. Thus, LSI nees other techniues than inexing rea: lots of memory Ulf Leser: Text Analytics, Winter Semester 2010/2011 40

Extene Boolean Moel One critiue to the Boolean Moel: If one term out of 10 is missing, the result is the same as if 10 were missing Iea: Measure istance for each conjunctive / isjunctive subterm of the uery exression to the ocument Examle: X-ary AD: use a rojection into x-im sace Query exression is 1,1,1,,1 Doc is a 1,a 2,,a x =0/1?,0/1?, Similarity is istance between these two oints Similar formulas for O an OT Using the aroriate efinition of istance, the extene Boolean moel may mimic both the Boolean an the VSM Ulf Leser: Text Analytics, Winter Semester 2010/2011 42

Generalize Vector Sace Moel One critiue to the VSM: Terms are not ineenent Thus, term vectors cannot be assume to be orthogonal Generalize Vector Sace Moel Buil a much larger vector sace with 2 K imensions Each imension minterm stans for all ocs containing a articular set of terms Minterms are not orthogonal but correlate by term co-occurrences Convert uery an ocs into minterm sace Finally, sim, is the cosine of the angel in minterm sace ice theory, inclues term co-occurrence, much more comlex than orinary VSM, no roven avantage Ulf Leser: Text Analytics, Winter Semester 2010/2011 43