Vocabulary Problem in Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California. fshli,

Size: px
Start display at page:

Download "Vocabulary Problem in Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California. fshli, danzigg@cs.usc."

Transcription

1 Vocabulary Problem in Internet Resource Discovery Technical Report USC-CS Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California fshli, Abstract When searching information in a retrieval system, people use a variety of terms to describe their information needs. When the terms used in a query are dierent from those indexed by the system, users fail to obtain the information they want. This is called the vocabulary problem. This problem has been studied and discussed in information retrieval for decades. Recently Deerwester et. al proposed a new technique based on singular value decomposition and obtained promising results. In this paper, we describe how to apply this technique to Internet resource discovery. Index Terms - information retrieval, polysemy, resource discovery, synonymy,vocabulary problem. 1 Introduction When searching information in a retrieval system, people use a variety of terms to describe their information needs. When the terms used in a query are dierent from those indexed by the system, users fail to obtain the information they want. This is called the vocabulary problem [1]. In general, there are two types of vocabulary problem. One is called the synonymy problem, the other the polysemy problem [2]. Both problems occur because people's backgrounds and problem contexts dier. Synonymy refers to the fact that people describe the same information using dierent terms. Polysemy refers to the fact that people use the same term for dierent meanings. To solve the vocabulary problem, Deerwester et al. propose Latent Semantic Indexing (LSI) [2]. They assume there is some underlying semantic structure in the pattern of term usage across documents and use Singular Value Decomposition (SVD) to capture this structure. LSI allows users to search documents based on their concepts or meaning rather than simple query terms. LSI has been tested on several information systems with promising results [2]. LSI performs well in traditional information retrieval systems. However, it can not be directly applied to the Internet where people search information from servers all over the world. In this paper, we propose a new solution based on LSI to address the vocabulary problem in Internet resource discovery [3]. Our method clusters servers according to their contents and allows users to select interested servers using their favorite taxonomies. 1

2 2 Latent Semantic Indexing LSI is an extension of Salton's Vector Space Model (VSM) [4], in which documents and queries are represented as vectors of the form d i = (a i 1 a i 2 ::: a i m ) q j = (b j 1 b j 2 ::: b j m ) where m is the number of distinct terms in the database, the coecients a i w and b j w represent the weight or frequency of term t w (1 w m) in document d i and query q j, respectively. For a database of n documents, it is represented as an m n term-document. The similarity between d i and q j is based on the number of their matching terms and is calculated by the cosine coecient [5], Sim(d i q j ) = P m w=1 a i w b j w q Pm P : w=1 a2 i w m w=1 b2 j w To capture the semantic structure among documents, LSI applies SVD to an m n termdocument and generates vectors of k (typically 100 to 300) orthogonal indexing dimensions, where each dimension represents a linearly independent concept. The decomposed vectors are used to represent both documents and terms in the same semantic space, while their values indicate the degrees of association with the k underlying concepts. Figure 1 shows SVD applied to a termdocument. db term-doc (m x n) SVD (k) term (m x k) document (n x k) Figure 1: SVD applies to an m n term-document, where m and n are the numbers of terms and documents in the database, and k is the number of orthogonal indexing dimensions used by SVD. A query vector in LSI is the weighted sum of its component termvectors. To determine relevant documents, the query is compared with all documents, and those with the highest cosine (using cosine coecient) are returned. Because k is chosen much smaller than the number of terms and documents in the database (i.e. the number of rows and columns in the term-document ), those k concepts are neither term nor document frequencies but compressed forms of both. Therefore, a query can hit documents without having common terms but with common concepts. Example 1 describes the query processing in both VSM and LSI. 2

3 Example 1 Let q, d i (1 i 5), and t j (1 j 8) be a user query, a set of documents, and their associated terms in an information system, where d 1 = ft 1 t 3 t 4 g d 2 = ft 1 t 2 t 3 g d 3 = ft 2 t 3 t 4 t 5 g d 4 = ft 7 t 8 g d 5 = ft 5 t 6 t 7 g q = ft 2 t 4 g: Table 1 shows their vector representations and cosine similarities in VSM and LSI. Figure 2 shows the 2-dimensional plot of the decomposed document and term vectors in LSI. doc/ term VSM LSI query description (t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 ) Sim (dim1 dim2) Sim d 1 t 1 t 3 t ;0.751 ; d 2 t 1 t 2 t ;0.805 ; d 3 t 2 t 3 t 4 t ; d 4 t 7 t ; ;0.035 d 5 t 5 t 6 t ; q t 2 t ;0.430 ; Table 1: The vector representations and similarities of documents d i (1 i 5) and query q in VSM and LSI. The \(t 1 :::t 8 )" and \(dim1 dim2)" columns show the vectors in VSM and LSI, respectively. The \Sim" columns give the cosine similarity between each d i and q. In traditional information retrieval systems, a document is relevant to a query if it contains all the terms in the query. For example, document d 3 is relevant toqueryq because it contains both t 2 and t 4 in q. In VSM, a document isrelevant to a query if their similarity is larger than a predened threshold. If the threshold is 0.7, then d 3 is relevant toq because Sim(d 3 q) > 0:7. Similar to VSM, LSI uses cosine similarity measure to determine the relevant documents. Using the same threshold, document d 1, d 2,andd 3 are relevant to query q because Sim(d i q) > 0:7 for 1 i 3. 2 The above example shows that given the same threshold, LSI returns more relevant documents than VSM. This is because the dimensions in decomposed vectors are not independent of each other. Therefore, two vectors can be relevant without having common term(s). 3 Internet Resource Discovery In a distributed environment, people search information by sending requests to associated information servers using a client-server model (Figure 3(a)). In this approach, the client needs to know 3

4 1 2-D Plot of Terms and Documents Dimension d3 d2 d1 t3 t4 q t2 t1 d5 t5 d4 t7 t6 t Dimension 1 Figure 2: The 2-dimensional plot of decomposed vectors in Example 1, where documents d i (1 i 5), terms t j (1 j 8), and query q are represented as \", \+", and \", respectively. The dashed cone represents the region where documents, such asd 1, d 2, and d 3, are within a cosine of 0.7 from query q, i.e. relevant to q. (a) (b) client request result server client result query server(s) directory request server server server Figure 3: (a) Client-Server Model. (b) Client-Directory-Server Model. 4

5 the server's name or address before sending a request. In the Internet where thousands of servers provide information, it becomes dicult and inecient to manually search all the servers. One step to solve this problem is to keep a \directory of services" that records the description or summary of each information server. A user sends his query to the directory of services which determines and ranks the information servers relevant to the user's request. The user employs the rankings when selecting the information servers to query directly. We call this the client-directoryserver model (Figure 3(b)). Internet resource discovery services [3], such as Archie [6], WAIS [7], WWW [8], Gopher [9], and Indie [10], allow users to retrieve information throughout the Internet. All of the above systems provide services similar to the client-directory-server model. For example, Archie, WAIS, and Indie support a global index like the directory of services in their systems. WWW and Gopher do not have a global index by themselves. Instead, they have an added-on indexing scheme built by other tools, such as Harvest [11], WWWW [12] for WWW and Veronica [13] for Gopher. The indexing schemes implemented in the above systems identify relevant servers by matching keywords between their database representatives and user queries. They do not deal with the vocabulary problem. This paper describes how to apply LSI in those systems. 4 LSI in Client-Directory-Server Model In a distributed environment, each database has a dierent set of documents and term usage. Thus, the correlations represented by the decomposed vectors in each database have completely dierent meanings, which results in a potential problem of using LSI in Internet resource discovery systems. If we collect the document and term vectors from all the databases at the directory of services, there is no way to determine the relevant servers for a user query. Tosolve this problem, we propose the Second-Level LSI and describe how it handles the synonymy and polysemy problems. 4.1 Synonymy When applying LSI in the client-directory-server environment, we can collect the document vectors from all the databases and perform the SVD at the directory of services. This approach is easily implemented, but it suers from high communication overhead for transmitting all the document vectors from servers to the directory of services. In addition, the directory of services needs a huge space to store the document vectors from all the databases, which is equivalent to the summation of their space requirements. To alleviate this problem, we can summarize the documents of a database in a uniform way that can be easily understood and veried by machine. Using existing clustering algorithms [14], a database can be divided into clusters, where each cluster contains documents having a number of common terms. The average of all the document vectors in a cluster is used as its representative (also called \cluster vector"). The whole database is represented by the union of all its cluster representatives in the form of a. Figure 4 shows the term-document and its associated in Example 1. The directory of services collects the cluster representatives from all the databases and builds a superset. These cluster representatives are treated as individual documents and decomposed by SVD. The decomposed term and cluster vectors are used for query processing just like the decomposed term and document vectors in the original LSI. Figure 5 shows the 5

6 (a) (b) d 1 d 2 d 3 d 4 d 5 t t t t t t t t c 1 c 2 c 3 t t t t t t t t Figure 4: (a) The term-document in Example 1, and (b) its associated. Clusters c 1, c 2,andc 3 consist of document sets fd 1 d 2 d 3 g, fd 4 g,and fd 5 g, respectively. The cluster representative isthe average of its component document vectors. architecture of Second-Level LSI. 4.2 Polysemy In Deerwester's experiments [2], LSI deals nicely with the synonymy problem but only oers a partial solution to the polysemy problem. The deciency results from the fact that a word with multiple meanings is represented as a weighted average of all dierent meanings, where each meaning is represented as a point in the semantic space. To reduce the polysemy problem in the client-directory-server environment, we add multiple taxonomies into our system to help classifying query and document terms. A taxonomy isa predened classication scheme, such as the ACM classication system [15] and the IEEE INSPEC thesaurus [16]. For each taxonomywe generate a \term-pseudo-document", where each pseudo document is a set of sub-category names or synonymous terms within the same category. For example, in the ACM classication system [15]: ::: C.2.3 C.2.4 ::: Network Operations Network management Network monitoring Public networks Distributed Systems Distributed applications Distributed databases Network operating systems 6

7 Server1 db1 (m1 x n1) Server2 db2 Server3 (m2 x n2) (m x n) Directory of Services SVD (k) term (m x k) cluster (n x k) db3 (m3 x n3) Figure 5: Second-Level LSI system architecture, where m i and n i are the numbers of terms and clusters in database i (1 i 3). The in the directory of services is the superset of those matrices in the three servers, where m and n are the numbers of its total terms and clusters, and k is the number of orthogonal indexing dimensions used by SVD. 7

8 we generate three pseudo documents d 1, d 2, and d 3, d 1 = fdistributed, network, operations, systemsg d 2 = fmanagement, monitoring, network, networks, publicg d 3 = fapplications, databases, distributed, network, operating, systemsg where d 1 is the union of terms in category headings C.2.3 and C.2.4, and d 2 and d 3 are the unions of terms in the sub-category headings under C.2.3 and C.2.4, respectively. Similarly, we can generate a term-pseudo-document to represent the whole taxonomy. By merging the superset with the term-pseudo-document in the directory of services and applying SVD on them, we can create a customized directory of services, which contains the decomposed term and cluster vectors in favor of the terminology used by the merged taxonomy. Below, we dene the merge operation and use Example 2 to demonstrate the customization. Figure 6 shows our proposed system with multiple taxonomies. Denition 1 Let A and B be two term-document matrices, where A consists of m 1 terms (rows) and n 1 documents (columns), and B consists of m 2 terms (rows) and n 2 documents (columns). Let T A and T B be the sets of terms in A and B, respectively. If A merges B is equal to C, denoted as A B = C, then C is a term-document consisting of m terms (rows) and n documents (columns), where n = n 1 + n 2 and m = jt A [ T B j. The counting measure j:j gives the size of the set. Example 2 Assume p 1 and p 2 are two pseudo documents created by the ACM taxonomy, where p 1 = ft 4 t 8 g p 2 = ft 2 t 6 g: We generate a term-pseudo-document from p 1 and p 2 and merge it with the in Figure 4(b). The result is showed in Figure 7. To examine the eect of merging a taxonomy, we calculate the Euclidean distance between cluster c i and c j, denoted as dist(c i c j ), before and after the merging. Table 2 and Figure 8 show the distances and 2-dimensional coordinates of clusters when merging with the ACM taxonomy. 2 In Table 2, the distance between clusters c 1 and c 2 decreases from to This change is because pseudo document p 1 contains both t 4 and t 8, which increases the correlation between any cluster containing either t 4 (such asc 1 )ort 8 (such asc 2 ). Similarly, the distance between c 1 and c 3 decreases due to p 2 's containing t 2 and t 6. For clusters c 2 and c 3, they do not have common terms with the same pseudo document, therefore their correlation decreases, i.e. dist(c 2 c 3 ) increases. From the example above, we can see the distance changes reecting the correlations between documents or clusters in the semantic space. When merging with a taxonomy, those clusters having common terms with pseudo documents move toward each other. By the denition of cosine similarity measure [4], all the clusters within a predened angle with the query are relevant toit. 8

9 Server1 db1 (m1 x n1) Directory of Services (ACM) term-pseudo-doc (m4 x n4) ACM Taxonomy SVD (k) term (m6 x k) cluster (n6 x k) Server2 db2 Server3 (m2 x n2) (m x n) Directory of Services SVD (k) term (m x k) cluster (n x k) db3 (m3 x n3) Directory of Services (IEEE) term-pseudo-doc (m5 x n5) IEEE Taxonomy SVD (k) term (m7 x k) cluster (n7 x k) Figure 6: Second-Level LSI with multiple taxonomies, where m i and n i are the numbers of terms and clusters in database i (1 i 3). The termcluster in the directory of service is the superset of those matrices in the three servers, where m and n are the numbers of its total terms and clusters, and k is the number of orthogonal indexing dimensions used by SVD. In the two customized directory of services, (m 4, n 4 ) and (m 5, n 5 )are the numbers of (terms, clusters) in the ACM and IEEE taxonomies (m 6, n 6 ) and (m 7, n 7 ) are those numbers after merging with the superset. 9

10 c 1 c 2 c 3 t t t t t t t t p 1 p 2 t t t t = c 1 c 2 c 3 p 1 p 2 t t t t t t t t Figure 7: The merges with the ACM term-pseudodocument. cluster before after change distance merging merging dierence (c 1 c 2 ) ;11.68% (c 1 c 3 ) ;48.40% (c 2 c 3 ) % Table 2: Cluster distances before and after merging with the ACM taxonomy. The \change dierence" column shows the distance increasing (\+") or decreasing (\;") rates. 10

11 2 2-D Plot of Terms and Clusters c1 p2 c2 c3 Dimension p1 c3-1 c2-1.5 c Dimension 1 Figure 8: The clusters in 2-dimensional space when merging with the ACM taxonomy. The ACM pseudo documents p 1 and p 2 are shown as \". The clusters before the merging are represented by c i (1 i 3), shown as \", and linked by dashed lines. The clusters after the merging are represented by c 0 i (1 i 3), shown as \", and linked by solid lines. 11

12 Based on this, those clusters close to each other are likely to be hit by the same query. Thus, by generating pseudo documents appropriately, we can alleviate the polysemy problem. Our method can also be applied to a small information system, where each user pre-selects documents of interests as the \user prole" and merges it with the term-document or at the server. Then the server can match documents and queries based on the terminology that users are familiar with or frequently use. 5 Design of Experiments To evaluate our solutions, we will conduct experiments on four standard document collections (MED, CISI, INSPEC, and CACM), where queries and relevant judgments are available. We use SVDPACKC package [17] to compute SVD on term-document matrices and measure the precision and recall ratios for dierent methods. In our experiment, each document collection is like a server's database. We do not test LSI on a single server. In stead, we generate cluster representatives for each database, collect them at the directory of services, and use LSI to rank relevant servers for each query. Our goal is to give high ranks to the servers that contain the most relevant documents. To verify the ranking estimated by the directory of services, we query each server with the same set of data and rank them according to their numbers of returned documents. We calculate the Spearman rank-order correlation coecient [18] to measure the closeness of the two rankings. If they are identical, it means the directory of services gives the user a perfect hint of selecting relevant servers. To examine the eect of merging a taxonomy, we generate pseudo documents from the ACM taxonomy [15] and merge them with the documents in the CACM database. Because the CACM database consists of documents in computer science, the adding ACM taxonomy should help in clustering documents under the same category. We expect to see higher recall and precision ratios. 6 Summary We have proposed a new method based on Deerwester's latent semantic indexing to solve the vocabulary problem in Internet resource discovery. We design two experiments to evaluate our method and expect to see improved results in both synonymy and polysemy. References [1] G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais, \The vocabulary problem in human-system communication," Communications of the ACM, vol. 30, pp. 964{971, November [2] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, \Indexing by latent semantic analysis," Journal of the American Society for Information Science, vol. 41, pp. 391{407, September

13 [3] K. Obraczka, P. B. Danzig, and S.-H. Li, \Internet resource discovery services," Computer, vol. 26, pp. 8{22, September [4] G. Salton and M. J. McGill, Introduction To Modern Information Retrieval. McGraw-Hill Book Company, [5] G. Salton, Automatic Information Organization and Retrieval. McGraw-Hill Book Company, [6] A. Emtage and P. Deutsch, \Archie: An electronic directory service for the internet," in Proceedings of the Winter 1992 USENIX Conference, pp. 93{110, [7] B. Kahle and A. Medlar, \An information system for corporate users: Wide Area Information Servers," ConneXions { The Interoperability Report, vol. 5, no. 11, pp. 2{9, [8] T. Berners-Lee, R. Cailliau, J.-F. Gro, and B. Pollermann, \World-Wide Web: The information universe," Electronic Networking: Research, Applications and Policy, vol. 1, no. 2, pp. 52{58, [9] M. McCahill, \The internet gopher protocol: A distributed server information system," ConneXions { The Interoperability Report, vol. 6, pp. 10{14, July [10] P. B. Danzig, S.-H. Li, and K. Obraczka, \Distributed indexing of autonomous internet services," Computing Systems, vol. 5, no. 4, pp. 433{459, [11] C. M. Bowman, P. B. Danzig, D. R. Hardy, U.Manber, and M. F. Schwartz, \Harvest: A scalable, customizable discovery and access system," Technical Report CU-CS , University of Colorado, [12] O. A. McBryan, \GENVL and WWWW: Tools for taming the Web," in Proceedings of the First International World-Wide Web Conference, May1994. [13] S. Foster, \About the Veronica service," November Electronic bulletin board posting on the comp.infosystems.gopher newsgroup. [14] P. Willett, \Recent trends in hierarchic document clustering: A critical review," Information Processing & Management, vol. 24, no. 5, pp. 577{597, [15] J. E. Sammet and A. Ralston, \The new (1982) computing review classication system - nal version," Communications of the ACM, vol. 25, pp. 13{25, January [16] IEEE Service Center, Inspec Thesaurus, [17] M. W. Berry, T. Do, G. W. O'Brien, V. Krishna, and S. Varadhan, \SVDPACKC (version 1.0) user's guide," Technical Report CS , University of Tennessee, October [18] M. Kendall and J. D. Gibbons, Rank Correlation Methods. Edward Arnold, fth ed.,