# Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search

Save this PDF as:

Size: px
Start display at page:

Download "Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search"

## Transcription

2 intuition, we define problems to design the optimal processing order such that we minimize the search processing time. After proving the ordering problems are NP-hard, we design the approximation algorithms. Finally, based on synthetic and real data sets (including input query logs and three real document sets), we conduct experiments to compare bitlist with the state of arts. The experimental results verify that bitlist outperforms the popular inverted list compression techniques [23, 22] and the recent work [8] in terms of space and searching time. The rest of the paper is organized as follows. First we introduce the basic structure of bitlist in Section 2. Next, we give the details to optimize the space of bitlist in Section 3, and design the keyword search algorithms in Section 4. After that, we evaluate bitlist in Section 5, and review related works in Section 6. Finally Section 7 concludes the paper. Fig. 1 summarizes the main symbols and associated meanings in the paper. Due to the space limit, we skip the proofs of theorems and refer interested readers to our technical report [18] for a full version of the paper. As a summary, our objective of this paper is to design a compact indexing structure with low space to answer the boolean-based keyword search efficiently (i.e., the union and intersection queries). 2.2 Structure of bitlist Before giving bitlist, we first introduce two alternative indexing schemes. For demonstration, these indexing schemes will index 12 documents d...d 11 shown in Table 1. For a specific document d j, we assume that the docid of d j is the integer j. doc terms doc terms doc terms d t 1, t 2, t 3 d 4 t, t 1 d 8 t 1, t 3 d 1 t, t 1, t 2, t 3 d 5 t d 9 t 2, t 3 d 2 t 3 d 6 t 3 d 1 t 2 d 3 t 2 d 7 t 3 d 11 t 3 Table 1: 12 documents with the associated terms First, the inverted list maintains a directory of terms. Each term t i (1 i T) in the directory refers to a posting list I i. I i contains docids of the documents containing t i (each document is associated with a unique docid). We denote the size of I i to be I i. Symbol Meaning t i, d j i-th term (1 i T), j-th document (1 j D) θ ij a binary bit to indicate whether d j contains t i I i, I i ; L i, L i Posting list of t i, size of I i ; Pair list of t i, size of L i did ik, eid ik k-th pair in L i with docid num. and encoded num. B, q, w i Base num., top-k query condition, weight of t i in q GS (d j), GS u(d j), AGS (d j) goodness score, and its upper bound, avg. goodness. D, D # of docs reassigned with new docids, docs in tail cell. Figure 1: Used Symbols and the meanings 2. OVERVIEW Section 2.1 defines the design objective and Section 2.2 presents the data structure of bitlist. 2.1 Design Objective Given a set of documents d j with 1 j D, we assume that each document d j contains d j terms and the D documents contain totally T terms t i with 1 i T. Our task is to design a full-text indexing structure for such documents. The design should meet two requirements: (i) low space to index the documents, and (ii) fast running time to answer the boolean-based keyword search. First, for a large number D of documents, the inverted list structure incurs high space. To reduce the space, various compression techniques are proposed, e.g., PForDelta coding [12, 25, 24, 23]. Such techniques typically compromise the search processing time caused by the decompression to restore original docids. Essentially the compression is independent of the keyword search and does not optimize search efficiency. Instead, bitlist integrates the compression and keyword search together for both lower space and faster keyword search. Second, in terms of keyword search, the state-of-the-art query processing systems involve a number of phases such as query parsing, query rewriting, and ranking aggregation scores [23]. The aggregation scores frequently involve hundred of features, and need complex machine-learning techniques to compute the scores. However, at the lower layer, such query processing algorithms fundamentally rely on extremely fast access to the index, which helps finding the documents containing the input terms. Furthermore, keyword search frequently involves boolean-based intersection and union operations over the input terms. In order to achieve the required processing time, the key is to quickly find the expected documents based on the boolean operations of input terms. Figure 2: Three Indexing schemes: (a) Inverted list; (b) a /1 matrix; (c) bitlist with B = 4; and (d) bitlist with B = 12 Example 1 Fig. 2(a) uses an inverted list to index the 12 documents. For example, the posting lists of the terms t and t 3 respectively contain 3 docids (i.e., 1, 4 and 5) and 8 docids (i.e.,, 1, 2, 6, 7, 8, 9 and 11) to represent the documents containing such terms. Second, we use a T D matrix, consisting of binary bits θ ij (1 i T and 1 j D), to indicate whether a document d j contains a term t i. If d j contains t i, the bit θ ij is 1, and otherwise. We assume that the documents (i.e., the columns) in the matrix are associated with consecutive docids from to D 1. Example 2 Fig. 2(b) uses a 4 12 matrix to indicate the occurrence of terms in documents. A problem of this scheme is the huge number of bits in the matrix. For example, given the terms t and t 3, no matter how many documents contain the terms, the matrix always uses the exactly same 12 bits (i.e., the number D of the indexed documents). This obviously incurs high space. Third, we treat the proposed bitlist as a hybrid of the inverted list and binary matrix, illustrated by the following examples. Example 3 We divide the 12 columns of Fig. 2(b) by a base number B = 4, and have 3 cells in each of the 4 rows. For example, for the term t, the associated bits 111 are divided into three cells 1, 11 and. The first cell 1 is associated with the docids, 1, 2 and 3. By treating the bits 1 as the binary form of the integer number 4, we represent the four docids of the whole cell by a pair, 4, where the first number is the leftmost docid in the cell and 4 is just the encoded number. Similarly, the second cell 11 and third one are represented by 4, 12 and 8,, respectively. Next, for the term t 3, the associated bits are divided into three cells 111, 11 and 111. We use 3 pairs, 14, 4, 3 and 8, 13 to represent the docids of 1523

3 the three cells, respectively. As shown in Fig. 2(c), each term refers to a list of the pairs. Note that if the encoded number in a pair is zero (e.g., the pair 8, ), we will not maintain the pair in order to reduce the space cost. Given the above pairs, we can restore the associated docids. For example, we consider the pair 4, 12 in row t. The binary number of 12 is 11. Then, we infer that the two documents, having the docids 4 and 5 respectively, contain t (due to the consecutive docids from to 11 in the columns of the matrix). Example 4 To reduce the space of bitlist, we use a larger base B = 12 to divide the matrix of Fig. 2(b). In Fig. 2(d), each term refers to only 1 pair. Instead, the terms t and t 3 in Fig. 2(c) with B = 4 refer to 2 and 3 pairs, respectively. Based on the above examples, we define the bitlist structure as follows. Specifically, bitlist maintains a directory of all T document terms t i. Each term t i in the directory refers to a list of the did ik, eid ik pairs. Formally, we denote the list of pairs (or equally the pair list) of t i by L i as follows. Definition 1 L i consists of a list of did ik, eid ik pairs sorted by ascending order of did ik, where did ik %B = and eid ik >. If the l-th leftmost bit ( l B 1) in the binary number of eid ik is 1, the document having the docid l + did ik must contain t i. In the above definition, did ik is the minimal (or leftmost) docid encoded by the did ik, eid ik pair, and did ik %B = indicates that we use the pair to encode the bits associated with B consecutive docids. In addition, for lower space, bitlist does not maintain the pairs did ik, eid ik with eid ik =. As shown above, we use a simple but yet efficient coding scheme by treating the bits of the B consecutive docids as the binary form of eid ik. Based on the definition, we can decode the pair did ik, eid ik back to the original docids. Given the binary form of eid ik, we determine whether a bit is either 1 or. If the l-th leftmost bit ( l B 1) is 1, the document having the docid did ik + l must contain t i (since the docids of the matrix columns are consecutive), and otherwise it does not contain t i. We are interested in the size L i of a pair list L i, and particularly, how L i is comparable with the size I i of the posting list I i. Theorem 1 For any term t i, I i /B L i I i holds. We have has introduced the basic structure of bitlist. Nevertheless, the basic structure is far beyond the optimal space cost and search efficiency. In rest of the paper, we present solutions to further optimize bitlist in terms of the space cost and search efficiency in Section 3 and Section 4, respectively. 3. SPACE OPTIMIZATION In this section, to optimize the space cost of bitlist, we first highlight the basic idea (Section 3.1), and define the optimization problem and show the complexity (Section 3.2). Next, we present the algorithm details (Section 3.3), give a practical design (Section 3.4), and finally report the maintenance of bitlist (Section 3.5). 3.1 Basic Idea In this section, we give the overview of two techniques to save the space of bitlist. (i) DocID re-assignment: It is easy to find that the space of bitlist depends on the non-zero eids and thus the key is to reduce the number of non-zero eids. To this end, we propose to re-assign a new docid for every document d j, illustrated by Fig. 3. Before the re-assignment, row t needs 2 pairs, 4 and 4, 12 (with B = 4). Figure 3: Re-assignment of docids To reduce the number of non-zero eids in this row, we exchange the docids between d and d 4 and the docids between d 2 and d 5 for the re-assignment of new docids. After the re-assignment, the row t maintains all -bits for the consecutive new docids (and all 1-bits for the consecutive new docids,1 and 2). The pair list of t maintains only one pair, 14. The re-assignment for a single row is simple. However, given totally T terms and D documents, the global re-assignment is rather challenging. Fig. 3 does optimize the space regarding to row t. Yet for row t 1, the re-assignment does not reduce the associated space cost (still using 3 pairs). Similar situation occurs for row t 2 and row t 3. Consequently, the re-assignment for the total T pair lists is challenging. Our purpose is to optimize the overall space of bitlist by minimizing the total number of non-zero eids. (ii) Setting the base B: Example 4 indicates that a lager base B might save more space. However, as shown above, the local reassignment is difficult to optimize the global space of all pair lists. Thus, a larger B unnecessarily means smaller space. Moreover, there are limitations of the bit length (e.g., an integer eid typically allows 32 bits). Finally, there are options to set either fixed a B for all pair lists or a various B adaptively decided by each pair list. In view of the above issues, we will give a practical design to set B for the tradeoff between the space and search processing time. 3.2 Problem Definition and Complexity In this section, we assume a fixed base B = 32, depending on the limit of machine word length. With the assumption, we first formally define the docid re-assignment problem to minimize the overall space, and then prove that it is NP-hard. Before giving the formal problem definition, we note that the key is to explicitly measure the overall space in terms of the number of non-zero eids. To this end, we define three binary parameters. Doc Membership: Given the D documents d j (1 j D) and associated T terms t i (1 i T), we define a binary coefficient α ij : { 1 if d α ij = j contains t i ; otherwise; Group Membership: Recall that based on a fixed base B, we divide the total D columns (i.e., documents) into R = D/B groups (denoted by R r with 1 r R). Each group R r has the B member docids. Note that in case that the (D%B), where % indicates the modulus operator, we have one group with the cardinality equal to D%B. Given the membership of R r, we define the second binary parameter β jr as follows. { 1 if the docid of d β jr = j is inside R r ; otherwise; Based on the above α ij and β jr, we define the third binary parameter γ ir as follows: { 1 if there exists docs d γ ir = j satisfying both α ij = 1 and β jr = 1; otherwise; The parameter γ ir = 1 indicates the existence of documents d j satisfying two requirements: (i) the docids of d j are inside R r, i.e., β jr = 1, and (ii) the documents d j meanwhile contain the term t i, 1524

4 i.e., α ij = 1. Thus, γ ir = 1 means that R r contains at least one member docid, such that the associated document contains t i. Based on the parameters, we compute the overall space of bitlist implicitly by the total number of non-zero eids. First, for a pair list of t i, if the group R r has associated with γ ir = 1, the cell regarding to the group R r contains at least one 1-bit. We then have a nonzero number eid. For the pair list of t i having R cells, we have the number L i = R r=1 γ ir of non-zero eids. Given T pair lists, we have totally T i=1 ( R r=1 γ ir ) non-zero eids. Now we define the following re-assignment problem. Problem 1. Given the D documents d j, the associated T terms t i, and the binary coefficient α ij with 1 j D and 1 i T, we want to configure the parameter β jr (with 1 r R), such that the overall space T i=1 ( R r=1 γ ir ) is minimized. Problem 1, reducible from the SET BASIS problem [11], is NPhard (the proof refers to the report [18]). Following the previous work [21, 15], the SET BASIS problem is equivalent to the minimal biclique cover (MBC) problem and hard to approximate. For example, results of Simon [2] and of Lund and Yannikakis [16] have shown that there is no polynomial time approximation to solve MBC (and SET BASIS) with factor n δ for δ> unless P=NP. 3.3 Heuristics Since Problem 1 is NP-hard, we propose a heuristic algorithm by the following basic idea. By re-assigning a new docid for a carefully selected document, we repeat the re-assignment by D iterations. In each iteration, we select a best document d j among the remaining unprocessed documents. The selected d j, together with those already selected documents, causes the least space. For the selected d j, we re-assign an incrementally larger new docid (e.g., starting from the integer zero). The re-assignment is terminated until all documents are re-assigned with new docids. In the rest of this section, we first define a new metric, the accumulation similarity of documents, to measure the goodness of the selected document d j (Section 3.3.1), next we formulate the problem to find the best document as a classic top-k (= 1) query problem [1], and finally propose an efficient algorithm to solve the top-k problem (Section 3.3.2). Please note that the defined accumulation similarity differs from the traditional pairwise document similarity [19, 5, 6], because the former one defines the similarity of ( 2) documents and the latter one involves the similarity of a pair of documents Measuring the goodness of a document d j The goodness of a candidate document d j depends on the increased space incurred by d j. For the document d j, we have a bit θ ij = 1 for every t i d j and otherwise θ ij = for t i d j. When d j is selected, we append the /1 bits to the tail cell of each pair list (suppose we have re-assigned new docids for the number D of documents, and the tail cell is the D /B -th one, having the new docids from B D /B to D ). Next, when the selected d j is reassigned with a new docid, a 1-bit is appended to the tail cell in the pair list of t i d j and a -bit is appended to the tail cells in the pair lists of all other terms t i d j. Example 5 In Fig. 4, we need to select one of the 4 remaining documents d j...d j+3 to re-assign a new docid. First if d j is selected, the 1-bit is appended to the two pair lists of t d j and t 1 d j, respectively. Because the tail cell in each of the two pair lists consists of one 1-bit and one -bit, the appending of 1-bit does not incur an extra non-zero eid. Next, the appending of -bit to the pair lists of t 2,t 3 and t 4 does not affect their original eids. Figure 4: Goodness of selecting documents. Base number B = 4, two documents d and d 1 have been re-assigned with consecutive docids, and still four documents d j...d j+3 remain to re-assign new docids. Second, if d j+1 is selected, the above situation occurs for the appending of 1-bit to the pair lists of t 1 and t 2, without incurring an extra non-zero eid. Next, if d j+2 is selected, the appending of 1-bit to the pair lists of t 2 and t 3 similarly does not incur an extra non-zero eid. Note that in the pair list of t 3, the tail cell consists of two 1-bits. We consider that the appending of 1-bit to the pair list of t 3 saves more space. Generally, if a tail cell contains more 1-bits, the appending of 1-bit to the cell could save more space. Finally, if d j+3 is selected,1-bit is appended to the pair lists of t 3 and t 4, and -bit to the other three pair lists. As before, the appending of 1-bit to the pair list of t 3 does not incur an extra eid. However, in the pair list of t 4, the appending of 1-bit incurs an extra non-zero eid. It is because the current cell in the pair list of t 4 contains two -bits. Thus, the selection of d j+3 incurs one extra non-zero eid. Until now, the goodness of d j depends on the two following benefits. (i) Column benefit: among the d j tail cells associated with t i d j, some of the appended 1-bits incur extra non-zero eids (e.g., the 1-bit appended to the pair list of t 4 d j+3 ) and others not (e.g., the 1-bits appended to the pair list of t d j and to the pair list of t 3 d j+3 ). Thus, we are interested in the number of extra non-zero eids caused by the appended 1-bits. (ii) Row benefit: for the 1-bits appended to the pair list of t d j and to the pair list of t 3 d j+3, it is obvious that the latter case saves more space than the former case. Intuitively, when the tail cell contains more 1-bits, we have chance to save more space. Based on the two benefits, we define the following formula to quantify the goodness score of d j : d j GS (d j ) = ϑ ti bs(t i ) (1) i=1 In the above formula, ϑ ti is a binary coefficient decided by whether the tail cell of t i d j contains at least one 1-bit. It is related to the column benefit. bs(t i ) is the number of 1-bits inside the tail cell of t i. It is related to the row benefit. Based on the above equation, in Example 5, d j+2 is associated with the largest goodness score 3. We thus select the document d j+2 as the best document. Note that if the widely used pairwise similarity [19, 5, 6] is adopted, we compute the similarity of a pair of documents. Following Fig. 4, the pairwise similarity between the document d 1 and any of four remaining documents is 1, and any remaining document can be selected as the best document. By referring back to Example 5, we easily verify that the selection of d j+2, instead of any other documents, offers more benefits in terms of the space cost of bitlist. Essentially, the above GS (d j ) accumulates the pairwise similarity and indicates more advantages than the pairwise similarity. In detail, suppose that the number D of documents have already been re-assigned with new docids. Then, the number D %B of documents (i.e., those assigned with the new docids ranging from 1525

5 B D /B to D ) are inside the tail cell. We denote such D %B documents to be D. The GS (d j ) is just equal to the sum of the pairwise similarity between d j and each of the documents D. Thus, we call GS (d j ) the accumulation similarity of d j. By the incorporation of both row benefit and column benefit, the above GS (d j ) intuitively ensures that all 1-bits are densely encoded by a small number of eid and leads to low space cost and fast keyword search. Due to the above difference between the pairwise similarity and accumulation similarity, the previous re-assignment schemes [19, 5, 6], based on pairwise similarity, are inapplicable for our problem (the reason refers to the related work section). We propose an accumulation similarity-based algorithm as follows Algorithm Details Top-k query formulation: Intuitively, the accumulation similarity GS (d j ) is an aggregation score function between a document d j and the documents D. We treat the selection of the best document d j as a top-k (=1) query problem, illustrated by the following example. Example 6 In Fig. 4, documents D (i.e., d and d 1 ) contain four terms t, t 1, t 2 and t 3, respectively having the non-zero weights 1, 1, 1 and 2 (Note that a term t i s weight w i is equal to bs(t i ) in Eq. 1, and bs(t i ) is the number of documents inside D containing t i ). Such terms and weights are together as the top-k query condition q. Then the top-1 query problem is to find a document d j among the remaining documents, such that the goodness score between q and d j is largest. Since the goodness score between q and d j+2 is 3 and all other scores are 2, we select d j+2 as the best document. We might follow the classic algorithm TA/NRA [1] to find the top-k (=1) document. To enable the TA/NRA algorithm, we assume that the remaining (D D ) documents d j are indexed as the traditional inverted list with the space of O( D D j=1 d j ) where d j is the number of terms inside d j. In the inverted list, each posting list maintains sorted docids in ascending order (note that such docids are the old ones before the re-assignment of new docids, and we denote the old docids to be oids). Next, the TA/NRA continuously scans the posting lists associated with the query terms t i q to select candidates of the final result, until the TA/NRA stopping condition is met. Once the final top-1 document d j is selected, we then append the bits of the terms t i d j to the tail cell for the construction of bitlist. When more documents are appended to the tail cell, D contains more documents. Then the number q of terms in q becomes larger, and the TA/NRA algorithm correspondingly scans more posting lists, leading to higher overhead. Basic Idea: We extend the TA/NRA algorithm for less processing overhead. The basic idea is as follows. We denote d c to be a current candidate having the largest goodness score GS (d c ) among the already known candidates. If the score GS (d c ) is larger than the goodness upper bound GS u (d j ) of any remaining document d j, then d c must be the final top-k (=1) document. Thus, the key is to measure the goodness upper bound GS u (d j ). To this end, we sort the query terms t i q by descending order of weights w i with w 1... w q. After that, we process the associated posting lists by the order of t 1... t q. That is, the posting lists I i with higher weights w i are processed first. If all the documents inside the posting list of a term t i (associated with the weight w i ) are processed, we then mark the term t i to be processed. Now we leverage the above processing order and derive the following theorem in terms of the upper bound GS u (d j ). Theorem 2 Assume we begin to scan the posting list of a term t i associated with the weight w i, for any document d j containing only those unprocessed terms (i.e., from t i to t q ), the associated goodness score upper bound is at more i+ d 1 i =i w i. In the above theorem, the parameter d denotes the number of terms inside a document, and we use the average value of the test data in our experiment. Once the above upper bound GS u (d j ) is even smaller than the goodness GS (d c ) of the current top-k (=1) candidate d c, we then return d c as the final result. Improvement: We improve Theorem 2 by setting a tighter upper bound GS u (d j ) as follows. We note that Theorem 2 implicitly assumes the terms from t i to t i+ d 1 all appear inside the same documents. In case that no document contains all such terms, we correspondingly reduce GS ub to a smaller value. Thus, we propose the following technique to determine whether there exists a document containing such terms t i... t i+ d 1. Inside a posting list I i of t i, we denote oid L i and oid U i to be the oids of the currently processed document and the final to-beprocessed document, respectively. Since the member oids in I i are sorted in ascending order, the unprocessed docids in I i must be inside the interval [oid L i, oidu i ]. For a term t i with i+1 i i+ d 1, we similarly have the interval [oid L i, oid U i ]. If the overlapping result [oid L i, oidu i ] [oid L i, oid U i ] is empty, then it is obvious that no documents contain the two terms t i and t i. Now, if the terms t i and t i do not appear in the same documents, we then consider the following scenarios to set a smaller GS ub : We assume that the d terms t i+1...t i...t i+ d could appear inside the same document, and then have a new upper bound d+i l=i+1 w l. It intuitively indicates that we slide down the original terms t i...t i...t i+ d 1 to the terms t i+1...t i...t i+ d. We alternatively assume that the (i i) terms t i...t i 1 and the ( d + i i ) terms t i +1...t i+ d could appear inside the same document and have the upper bound d+i l=i w l w i + w ti+ d. Given the above two bounds, we use the larger one as GS u (d j ) without introducing false negative. Algorithm 1: Select best candidate (input bitlist and inverted list) 1 build the document selection query q consisting of q pairs t i, w i with w i > ; 2 sort the pairs t i, w i with w 1... w q ; 3 denote GS ub to be the goodness score upper bound of an unprocessed doc; 4 initiate a variable d c to be the candidate result; 5 for i=1... q do 6 GS ub = d i =i+1 w i, i.e., sum of top- d weights of unprocessed terms; 7 foreach unprocessed doc d j in the inverted list of t i do 8 if GS (d j ) > GS (d c ) then d c = d j ; 9 foreach term t i from t i+1 to t i+ d do 1 if [oidi L, oidu i ] [oid L i, oid U i ] == null then 11 GS ub = larger one btw. d i =i w i w i + w i+ d and d+i i =i+1 w i ; 12 if GS (d c ) > GS ub then break; 13 post-process the input bitlist and inverted lists, and return d c; Pseudocode: Based on the above improvement, we report the pseudocode in Alg. 1. First, we build the query condition q based on the tail cells. Since only the terms having weights w i > contribute to the goodness score, the input query q contain only the terms having w i >. Next we sort the weights w i by descending order. After that, we define two variables GS ub and d c to be the goodness score upper bound of any unprocessed document and a candidate of the final chosen document, respectively. Next, for each document d j inside the currently processed I i, we determine whether GS (d j ) > GS (d c ) holds. If true, we then let d c = d j. After that, among the terms t i+1...t i+ d 1 which contribute to the GS ub, we verify whether there exist a document containing all such terms from d i...d i+ d 1 by the for loop in lines In the loop, we follow the aforementioned idea to determine whether the 1526

9 TREC AP TREC WT Blog data Space Intersection Union Space Intersection Union Space Intersection Union Inv. List MB.99 ms.962 ms 39.3 MB.231 ms.194 ms MB.511 ms.225 ms IntGroup MB.553 ms MB.937 ms MB.612 ms RanGroupScan MB.792 ms MB.195 ms MB.255 ms Kamikaze MB.168 ms.269 ms 1.66 MB.512 ms.1957 ms MB.626 ms.325 ms Zip 4.61 MB ms ms 5.56 MB ms ms 38.8 MB ms ms bitlist-fix 6.25 MB.816 ms.836 ms 7.25 MB.15 ms.116 ms MB.45 ms.169 ms bitlist-dyn 4.82 MB.14 ms.112 ms 5.95 MB.452 ms.348 ms MB.593 ms.38 ms Table 2: Space and query processing time on three real data sets and generate 111,772 files with on average unique terms per file. The above three data sets were stemmed with the Porter algorithm and common stop words e.g., the and and were removed. Query log: We conduct keyword search based on a query log file (with 8, queries) of a search engine. For each query, we conduct intersection and union over the above three data sets. In the query log, the average number of terms per query is The largest number of terms per query is 11. We note that though around 38.13% queries contain only 1 term, the remaining queries contain at least 2 terms. As a result, the optimal order of processing input terms is still useful to improve the efficiency of those queries containing multiple terms Synthetic Data Set Documents: We generate documents to measure how bitlist performs well on three key parameters (similarities s of documents, number d j of terms in documents d j, and number D of documents to be indexed). Specifically, to generate a document, we first follow the Zipf distribution to generate the number d j with a given maximal number of terms. Next, depending on the similarity s, we select the document term by probability s among 1, available subject words, and the remaining terms among a very huge amount (2 32 1, ) of words. In this way, a larger s indicates that these generated documents more similarly contain the subject words. Queries: Depending on the query length, we randomly select the average three query terms per query among the 1, subject words Metrics and Counterparts We use the following counterparts (See Section 6 for the introduction of [23, 22, 8]). (i) Kamikaze: We use the LinkedIn s open source implementation of [23, 22], namely Kamikaze version 3..6 [1, 2] released on Jan. 4th, 212. Kamikaze supports the optimized PForDelta compression [23, 22] over sorted arrays of docids and docid set operations (intersection and union). (ii) Inverted list: We use Kamikaze s Integer arraylist to implement the uncompressed inverted list and then conduct the boolean-based keyword search (intersection and union). (iii) Zip: For comparison, we use the Zip compression package provided by JDK to compress the sorted arrays of docids (offered by Kamikaze) and measure the space. Next, we then decompress the compressed docids before we conduct the docid set operations. (iv) IntGroup and Ran- GroupScan [8]: Following the previous work [8], we implement the the fixed-width partition algorithm (called by IntGroup) and the efficient and simple algorithm based on randomized partitions (called RanGroupScan). Since the work [8] only supports intersection searches, we ignore the evaluation of IntGroup and RanGroup- Scan for union searches. Since Kamikaze is implemented by Java, for fairness, we implement IntGroup and RanGroupScan and the proposed bitlist-fix and bitlist-dyn also by JDK 1.6. (64 bits). We measure the space and average query time of the above schemes, and are particularly interested in (i) how the two bitlist versions are comparable with Kamikaze and Zip in terms of the space, and (ii) how bitlist-fix and bitlist-dyn are comparable with Int- Group and RanGroupScan in terms of the query processing time. 5.2 Experiments on Real Data In this section, the baseline test first compares the bitlist-fix and bitlist-dyn (setting B = 32) with the four counterparts. Next, the sensitivity test varies the parameters (such as B and the term processing order) to study how the performance of bitlist responds to the changes of such parameters. All results in this section is based on the three real document sets. Baseline test: We report the results of the baseline test in Table 2. First for the TREC AP data, the Zip approach uses the least space and RanGroupScan has the fastest running time. Compared with the Zip approach, bitlist-dyn uses the slightly more space, but achieves a significantly smaller running time than Zip. It is because the running time of Zip is dominated by the decompression time (around 159 ms). Next compared with the Kamikaze approach, bitlist-fix uses 47.6% of Kamikaze s space, and meanwhile achieves 51.43% and 68.91% less running time for the intersection and union queries than Kamikaze. Meanwhile bitlist-dyn uses 22.89% less space than bitlist-fix, but with 21.53% and 25.36% more running time for the intersection and union queries than bitlist-fix. In terms of IntGroup and RanGroupScan, the associated high space is caused by the reverse mapping from hash bits to the original docids. Moreover, due to the existence of hash collusion, one hash bit could be associated with multiple docids, which harm the running time of IntGroup and RanGroupScan. When the number of documents ia larger, the accumulated chance of such collusion becomes higher, incurring higher running time. This is verified by the results using the large data set TREC WT and blog data sets, and bitlist-fix uses less running time than RanGroupScan. Next, for the TREC WT data, the two bitlist implementations consistently follow the similar results as the TREC AP data. Note that though TREC WT contains much more documents than TREC AP (with more than 1 folds of the number of documents), the used space of uncompressed inverted list is only 1.62 folds. It is because the average number of terms per document in TREC WT is much smaller than the one in TREC AP and particularly the overall number of (distinct) terms in TREC WT is even smaller than the one in TREC AP. In addition, for all approaches, the query processing time of TREC WT is larger than the one of TREC AP. It is due to the average size of a posting list (i.e., the average number of docids per posting list) in TREC WT is much larger than the one in TREC AP. Third, in terms of the social blog data, the associated result is roughly consistent with the results of the two TREC data. Nevertheless, the compression rates of Kamikaze and two bitlist versions are relatively smaller than those of the two TREC data. By carefully analyzing the original inverted lists of three data sets, we find that the average size per posting list in the social blog set, only 8.47, is much smaller than the average size of the TREC AP and WT data sets with and respectively, and meanwhile the number of the unique terms in the social blog set, 1,31,15, 153

### Techniques Covered in this Class. Inverted List Compression Techniques. Recap: Search Engine Query Processing. Recap: Search Engine Query Processing

Recap: Search Engine Query Processing Basically, to process a query we need to traverse the inverted lists of the query terms Lists are very long and are stored on disks Challenge: traverse lists as quickly

### Information Retrieval

Today s focus Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Efficient scoring Retrieval get docs matching query from

### Can you design an algorithm that searches for the maximum value in the list?

The Science of Computing I Lesson 2: Searching and Sorting Living with Cyber Pillar: Algorithms Searching One of the most common tasks that is undertaken in Computer Science is the task of searching. We

### 2.3 Scheduling jobs on identical parallel machines

2.3 Scheduling jobs on identical parallel machines There are jobs to be processed, and there are identical machines (running in parallel) to which each job may be assigned Each job = 1,,, must be processed

### Theory of Computation Prof. Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology, Madras

Theory of Computation Prof. Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture No. # 31 Recursive Sets, Recursively Innumerable Sets, Encoding

### Why? A central concept in Computer Science. Algorithms are ubiquitous.

Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online

### ECEN 5682 Theory and Practice of Error Control Codes

ECEN 5682 Theory and Practice of Error Control Codes Convolutional Codes University of Colorado Spring 2007 Linear (n, k) block codes take k data symbols at a time and encode them into n code symbols.

### Notes on Complexity Theory Last updated: August, 2011. Lecture 1

Notes on Complexity Theory Last updated: August, 2011 Jonathan Katz Lecture 1 1 Turing Machines I assume that most students have encountered Turing machines before. (Students who have not may want to look

### Math212a1010 Lebesgue measure.

Math212a1010 Lebesgue measure. October 19, 2010 Today s lecture will be devoted to Lebesgue measure, a creation of Henri Lebesgue, in his thesis, one of the most famous theses in the history of mathematics.

### Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

### Matrix Algebra. Some Basic Matrix Laws. Before reading the text or the following notes glance at the following list of basic matrix algebra laws.

Matrix Algebra A. Doerr Before reading the text or the following notes glance at the following list of basic matrix algebra laws. Some Basic Matrix Laws Assume the orders of the matrices are such that

### Solution of Linear Systems

Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

### Near Optimal Solutions

Near Optimal Solutions Many important optimization problems are lacking efficient solutions. NP-Complete problems unlikely to have polynomial time solutions. Good heuristics important for such problems.

### MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

### Cluster Description Formats, Problems and Algorithms

Cluster Description Formats, Problems and Algorithms Byron J. Gao Martin Ester School of Computing Science, Simon Fraser University, Canada, V5A 1S6 bgao@cs.sfu.ca ester@cs.sfu.ca Abstract Clustering is

### Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

### Approximate Search Engine Optimization for Directory Service

Approximate Search Engine Optimization for Directory Service Kai-Hsiang Yang and Chi-Chien Pan and Tzao-Lin Lee Department of Computer Science and Information Engineering, National Taiwan University, Taipei,

### Similarity and Diagonalization. Similar Matrices

MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

### Inverted Indexes Compressed Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 40

Inverted Indexes Compressed Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 40 Compressed Inverted Indexes It is possible to combine index compression and

### Lecture 3: Linear Programming Relaxations and Rounding

Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can

### Offline sorting buffers on Line

Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

### Approximability of Two-Machine No-Wait Flowshop Scheduling with Availability Constraints

Approximability of Two-Machine No-Wait Flowshop Scheduling with Availability Constraints T.C. Edwin Cheng 1, and Zhaohui Liu 1,2 1 Department of Management, The Hong Kong Polytechnic University Kowloon,

### Discuss the size of the instance for the minimum spanning tree problem.

3.1 Algorithm complexity The algorithms A, B are given. The former has complexity O(n 2 ), the latter O(2 n ), where n is the size of the instance. Let n A 0 be the size of the largest instance that can

### NOTES ON LINEAR TRANSFORMATIONS

NOTES ON LINEAR TRANSFORMATIONS Definition 1. Let V and W be vector spaces. A function T : V W is a linear transformation from V to W if the following two properties hold. i T v + v = T v + T v for all

### Information Retrieval

Information Retrieval Indexes All slides unless specifically mentioned are Addison Wesley, 2008; Anton Leuski & Donald Metzler 2012 1 Search Process Information need query text objects representation representation

### CHAPTER 3 Numbers and Numeral Systems

CHAPTER 3 Numbers and Numeral Systems Numbers play an important role in almost all areas of mathematics, not least in calculus. Virtually all calculus books contain a thorough description of the natural,

### Lecture 3: Finding integer solutions to systems of linear equations

Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture

### An important class of codes are linear codes in the vector space Fq n, where F q is a finite field of order q.

Chapter 3 Linear Codes An important class of codes are linear codes in the vector space Fq n, where F q is a finite field of order q. Definition 3.1 (Linear code). A linear code C is a code in Fq n for

### Review: Vector space

Math 2F Linear Algebra Lecture 13 1 Basis and dimensions Slide 1 Review: Subspace of a vector space. (Sec. 4.1) Linear combinations, l.d., l.i. vectors. (Sec. 4.3) Dimension and Base of a vector space.

### Information Retrieval. Exercises Skip lists Positional indexing Permuterm index

Information Retrieval Exercises Skip lists Positional indexing Permuterm index Faster merging using skip lists Recall basic merge Walk through the two postings simultaneously, in time linear in the total

LZ77 The original LZ77 algorithm works as follows: A phrase T j starting at a position i is encoded as a triple of the form distance, length, symbol. A triple d, l, s means that: T j = T [i...i + l] =

### MATH 2030: SYSTEMS OF LINEAR EQUATIONS. ax + by + cz = d. )z = e. while these equations are not linear: xy z = 2, x x = 0,

MATH 23: SYSTEMS OF LINEAR EQUATIONS Systems of Linear Equations In the plane R 2 the general form of the equation of a line is ax + by = c and that the general equation of a plane in R 3 will be we call

### ! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three

### Binary Coded Web Access Pattern Tree in Education Domain

Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: kc.gomathi@gmail.com M. Moorthi

### Approximation Algorithms

Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

### Min-cost flow problems and network simplex algorithm

Min-cost flow problems and network simplex algorithm The particular structure of some LP problems can be sometimes used for the design of solution techniques more efficient than the simplex algorithm.

### Binary Search. Search for x in a sorted array A.

Divide and Conquer A general paradigm for algorithm design; inspired by emperors and colonizers. Three-step process: 1. Divide the problem into smaller problems. 2. Conquer by solving these problems. 3.

### Introduction to Flocking {Stochastic Matrices}

Supelec EECI Graduate School in Control Introduction to Flocking {Stochastic Matrices} A. S. Morse Yale University Gif sur - Yvette May 21, 2012 CRAIG REYNOLDS - 1987 BOIDS The Lion King CRAIG REYNOLDS

### AN ALGORITHM FOR DETERMINING WHETHER A GIVEN BINARY MATROID IS GRAPHIC

AN ALGORITHM FOR DETERMINING WHETHER A GIVEN BINARY MATROID IS GRAPHIC W. T. TUTTE. Introduction. In a recent series of papers [l-4] on graphs and matroids I used definitions equivalent to the following.

### THE NUMBER OF REPRESENTATIONS OF n OF THE FORM n = x 2 2 y, x > 0, y 0

THE NUMBER OF REPRESENTATIONS OF n OF THE FORM n = x 2 2 y, x > 0, y 0 RICHARD J. MATHAR Abstract. We count solutions to the Ramanujan-Nagell equation 2 y +n = x 2 for fixed positive n. The computational

### 8 Primes and Modular Arithmetic

8 Primes and Modular Arithmetic 8.1 Primes and Factors Over two millennia ago already, people all over the world were considering the properties of numbers. One of the simplest concepts is prime numbers.

### Zeros of Polynomial Functions

Zeros of Polynomial Functions The Rational Zero Theorem If f (x) = a n x n + a n-1 x n-1 + + a 1 x + a 0 has integer coefficients and p/q (where p/q is reduced) is a rational zero, then p is a factor of

### Dynamic Programming. Jaehyun Park. CS 97SI Stanford University. June 29, 2015

Dynamic Programming Jaehyun Park CS 97SI Stanford University June 29, 2015 Outline Dynamic Programming 1-dimensional DP 2-dimensional DP Interval DP Tree DP Subset DP Dynamic Programming 2 What is DP?

### Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

### Multimedia Communications. Huffman Coding

Multimedia Communications Huffman Coding Optimal codes Suppose that a i -> w i C + is an encoding scheme for a source alphabet A={a 1,, a N }. Suppose that the source letter a 1,, a N occur with relative

### ! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions Basic Steps in Query

### ON THE COMPLEXITY OF THE GAME OF SET. {kamalika,pbg,dratajcz,hoeteck}@cs.berkeley.edu

ON THE COMPLEXITY OF THE GAME OF SET KAMALIKA CHAUDHURI, BRIGHTEN GODFREY, DAVID RATAJCZAK, AND HOETECK WEE {kamalika,pbg,dratajcz,hoeteck}@cs.berkeley.edu ABSTRACT. Set R is a card game played with a

### KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

### 1 VECTOR SPACES AND SUBSPACES

1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such

### JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004

Scientiae Mathematicae Japonicae Online, Vol. 10, (2004), 431 437 431 JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS Ondřej Čepeka and Shao Chin Sung b Received December May 12, 2003; revised February

### NP-Completeness and Cook s Theorem

NP-Completeness and Cook s Theorem Lecture notes for COM3412 Logic and Computation 15th January 2002 1 NP decision problems The decision problem D L for a formal language L Σ is the computational task:

### CMPS 102 Solutions to Homework 1

CMPS 0 Solutions to Homework Lindsay Brown, lbrown@soe.ucsc.edu September 9, 005 Problem..- p. 3 For inputs of size n insertion sort runs in 8n steps, while merge sort runs in 64n lg n steps. For which

### Efficient Algorithms for Masking and Finding Quasi-Identifiers

Efficient Algorithms for Masking and Finding Quasi-Identifiers Rajeev Motwani Stanford University rajeev@cs.stanford.edu Ying Xu Stanford University xuying@cs.stanford.edu ABSTRACT A quasi-identifier refers

### Student: Yu Cheng (Jade) Math 475 Homework #5 April 21, 2010

Student: Yu Cheng (Jade) Math 475 Homework #5 April 1, 010 Course site Exercises 4 Problem 1: The procedure below takes an array of integers and determines if some elements occur three (or more) times

### Experimental Comparison of Set Intersection Algorithms for Inverted Indexing

ITAT 213 Proceedings, CEUR Workshop Proceedings Vol. 13, pp. 58 64 http://ceur-ws.org/vol-13, Series ISSN 1613-73, c 213 V. Boža Experimental Comparison of Set Intersection Algorithms for Inverted Indexing

### DETERMINANTS. b 2. x 2

DETERMINANTS 1 Systems of two equations in two unknowns A system of two equations in two unknowns has the form a 11 x 1 + a 12 x 2 = b 1 a 21 x 1 + a 22 x 2 = b 2 This can be written more concisely in

### A Short Introduction to Binary Numbers

A Short Introduction to Binary Numbers Brian J. Shelburne Department of Mathematics and Computer Science Wittenberg University 0. Introduction The development of the computer was driven by the need to

### arm DBMS File Organization, Indexes 1. Basics of Hard Disks

DBMS File Organization, Indexes 1. Basics of Hard Disks All data in a DB is stored on hard disks (HD). In fact, all files and the way they are organised (e.g. the familiar tree of folders and sub-folders

### Discrete Optimization

Discrete Optimization [Chen, Batson, Dang: Applied integer Programming] Chapter 3 and 4.1-4.3 by Johan Högdahl and Victoria Svedberg Seminar 2, 2015-03-31 Todays presentation Chapter 3 Transforms using

### Fault Localization Using Passive End-to-End Measurement and Sequential Testing for Wireless Sensor Networks

Fault Localization Using Passive End-to-End Measurement and Sequential Testing for Wireless Sensor Networks Bing Wang, Wei Wei, Wei Zeng Computer Science & Engineering Dept. University of Connecticut,

### Fast Contextual Preference Scoring of Database Tuples

Fast Contextual Preference Scoring of Database Tuples Kostas Stefanidis Department of Computer Science, University of Ioannina, Greece Joint work with Evaggelia Pitoura http://dmod.cs.uoi.gr 2 Motivation

### 4. Number Theory (Part 2)

4. Number Theory (Part 2) Terence Sim Mathematics is the queen of the sciences and number theory is the queen of mathematics. Reading Sections 4.8, 5.2 5.4 of Epp. Carl Friedrich Gauss, 1777 1855 4.3.

### Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

### 7.1 Introduction. CSci 335 Software Design and Analysis III Chapter 7 Sorting. Prof. Stewart Weiss

Chapter 7 Sorting 7.1 Introduction Insertion sort is the sorting algorithm that splits an array into a sorted and an unsorted region, and repeatedly picks the lowest index element of the unsorted region

### A Catalogue of the Steiner Triple Systems of Order 19

A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of

### AS-2261 M.Sc.(First Semester) Examination-2013 Paper -fourth Subject-Data structure with algorithm

AS-2261 M.Sc.(First Semester) Examination-2013 Paper -fourth Subject-Data structure with algorithm Time: Three Hours] [Maximum Marks: 60 Note Attempts all the questions. All carry equal marks Section A

### CS383, Algorithms Notes on Lossless Data Compression and Huffman Coding

Prof. Sergio A. Alvarez http://www.cs.bc.edu/ alvarez/ Fulton Hall 410 B alvarez@cs.bc.edu Computer Science Department voice: (617) 552-4333 Boston College fax: (617) 552-6790 Chestnut Hill, MA 02467 USA

### Two General Methods to Reduce Delay and Change of Enumeration Algorithms

ISSN 1346-5597 NII Technical Report Two General Methods to Reduce Delay and Change of Enumeration Algorithms Takeaki Uno NII-2003-004E Apr.2003 Two General Methods to Reduce Delay and Change of Enumeration

### Network (Tree) Topology Inference Based on Prüfer Sequence

Network (Tree) Topology Inference Based on Prüfer Sequence C. Vanniarajan and Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036 vanniarajanc@hcl.in,

### Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9

Homework 2 Page 110: Exercise 6.10; Exercise 6.12 Page 116: Exercise 6.15; Exercise 6.17 Page 121: Exercise 6.19 Page 122: Exercise 6.20; Exercise 6.23; Exercise 6.24 Page 131: Exercise 7.3; Exercise 7.5;

### Functional-Repair-by-Transfer Regenerating Codes

Functional-Repair-by-Transfer Regenerating Codes Kenneth W Shum and Yuchong Hu Abstract In a distributed storage system a data file is distributed to several storage nodes such that the original file can

### The Trip Scheduling Problem

The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems

### Procedia Computer Science 00 (2012) 1 21. Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

Procedia Computer Science 00 (2012) 1 21 Procedia Computer Science Top-k best probability queries and semantics ranking properties on probabilistic databases Trieu Minh Nhut Le, Jinli Cao, and Zhen He

### Memoization/Dynamic Programming. The String reconstruction problem. CS125 Lecture 5 Fall 2016

CS125 Lecture 5 Fall 2016 Memoization/Dynamic Programming Today s lecture discusses memoization, which is a method for speeding up algorithms based on recursion, by using additional memory to remember

### Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

### It is not immediately obvious that this should even give an integer. Since 1 < 1 5

Math 163 - Introductory Seminar Lehigh University Spring 8 Notes on Fibonacci numbers, binomial coefficients and mathematical induction These are mostly notes from a previous class and thus include some

### CS 3719 (Theory of Computation and Algorithms) Lecture 4

CS 3719 (Theory of Computation and Algorithms) Lecture 4 Antonina Kolokolova January 18, 2012 1 Undecidable languages 1.1 Church-Turing thesis Let s recap how it all started. In 1990, Hilbert stated a

### 4.5 Linear Dependence and Linear Independence

4.5 Linear Dependence and Linear Independence 267 32. {v 1, v 2 }, where v 1, v 2 are collinear vectors in R 3. 33. Prove that if S and S are subsets of a vector space V such that S is a subset of S, then

### Sampling and Hypothesis Testing

Population and sample Sampling and Hypothesis Testing Allin Cottrell Population : an entire set of objects or units of observation of one sort or another. Sample : subset of a population. Parameter versus

### ! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of

### Lecture 6: Approximation via LP Rounding

Lecture 6: Approximation via LP Rounding Let G = (V, E) be an (undirected) graph. A subset C V is called a vertex cover for G if for every edge (v i, v j ) E we have v i C or v j C (or both). In other

### Linear Codes. Chapter 3. 3.1 Basics

Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

### Binary Codes for Nonuniform

Binary Codes for Nonuniform Sources (dcc-2005) Alistair Moffat and Vo Ngoc Anh Presented by:- Palak Mehta 11-20-2006 Basic Concept: In many applications of compression, decoding speed is at least as important

### Diagonal, Symmetric and Triangular Matrices

Contents 1 Diagonal, Symmetric Triangular Matrices 2 Diagonal Matrices 2.1 Products, Powers Inverses of Diagonal Matrices 2.1.1 Theorem (Powers of Matrices) 2.2 Multiplying Matrices on the Left Right by

### 1 Introduction to Counting

1 Introduction to Counting 1.1 Introduction In this chapter you will learn the fundamentals of enumerative combinatorics, the branch of mathematics concerned with counting. While enumeration problems can

### Integer multiplication

Integer multiplication Suppose we have two unsigned integers, A and B, and we wish to compute their product. Let A be the multiplicand and B the multiplier: A n 1... A 1 A 0 multiplicand B n 1... B 1 B

### Binary Image Scanning Algorithm for Cane Segmentation

Binary Image Scanning Algorithm for Cane Segmentation Ricardo D. C. Marin Department of Computer Science University Of Canterbury Canterbury, Christchurch ricardo.castanedamarin@pg.canterbury.ac.nz Tom

### Lecture 2: Universality

CS 710: Complexity Theory 1/21/2010 Lecture 2: Universality Instructor: Dieter van Melkebeek Scribe: Tyson Williams In this lecture, we introduce the notion of a universal machine, develop efficient universal

### IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

### Search and Information Retrieval

Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

### Efficient Use of the Query Optimizer for Automated Physical Design

Efficient Use of the Query Optimizer for Automated Physical Design Stratos Papadomanolakis Debabrata Dash Anastasia Ailamaki Carnegie Mellon University Ecole Polytechnique Fédérale de Lausanne {stratos,ddash,natassa}@cs.cmu.edu

### Induction. Margaret M. Fleck. 10 October These notes cover mathematical induction and recursive definition

Induction Margaret M. Fleck 10 October 011 These notes cover mathematical induction and recursive definition 1 Introduction to induction At the start of the term, we saw the following formula for computing

### Data Storage - II: Efficient Usage & Errors. Contains slides from: Naci Akkök, Hector Garcia-Molina, Pål Halvorsen, Ketil Lund

Data Storage - II: Efficient Usage & Errors Contains slides from: Naci Akkök, Hector Garcia-Molina, Pål Halvorsen, Ketil Lund Overview Efficient storage usage Disk errors Error recovery INF3 7.3.26 Ellen

### IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE

### 5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1 General Integer Linear Program: (ILP) min c T x Ax b x 0 integer Assumption: A, b integer The integrality condition

### Multi-Approximate-Keyword Routing in GIS Data

Multi-Approximate-Keyword Routing in GIS Data Bin Yao Mingwang Tang Feifei Li yaobin@cs.sjtu.edu.cn, {tang,lifeifei}@cs.utah.edu Department of Computer Science and Engineering, Shanghai JiaoTong University

### Schema Management for Document Stores

Schema Management for Document Stores Lanjun Wang, Oktie Hassanzadeh, Shuo Zhang, Juwei Shi, Limei Jiao, Jia Zou, and Chen Wang {wangljbj, shuozh, jwshi, jiaolm}@cn.ibm.com, hassanzadeh@us.ibm.com, Jia.Zou.US@ieee.org,

### 1. Prove that the empty set is a subset of every set.

1. Prove that the empty set is a subset of every set. Basic Topology Written by Men-Gen Tsai email: b89902089@ntu.edu.tw Proof: For any element x of the empty set, x is also an element of every set since