Approximate String Membership Checking: A Multiple Filter, Optimization-Based Approach
|
|
|
- Tracy Walton
- 9 years ago
- Views:
Transcription
1 Approximate String Membership Checking: A Multiple Filter, Optimization-Based Approach Chong Sun, Jeffrey Naughton 2, Siddharth Barman 3 Computer Science Department, Uniersity of Wisconsin, Madison 2 W. Dayton St., Madison, WI 5375 USA [email protected] 2 [email protected] 3 [email protected] Abstract We consider the approximate string membership checking (ASMC) problem of extracting all the strings or substrings in a document that approximately match some string in a gien dictionary. To sole this problem, the current state-ofart approach inoles first applying an approximate, fast filter, then applying a more expensie exact erification algorithm to the strings that pass the filter. Correspondingly, many string filters hae been prosed. We note that different filters are good at eliminating different strings, deping on the characteristics of the strings in both the documents and the dictionary. We suspect that no single filter will dominate all other filters eerywhere. Gien an ASMC problem instance and a set of string filters, we need to select the timal filter to maximize the performance. Furthermore, in our experiments we found that in some cases a sequence of filters dominates any of the filters of the sequence in isolation, and that the best set of filters and their ordering dep upon the specific problem instance encountered. Accordingly, we prose that the approximate match problem be iewed as an timization problem, and ealuate a number of techniques for soling this timization problem. I. INTRODUCTION We consider a common and ubiquitous string membership checking problem: gien a dictionary of strings, and a collection of documents, find all occurrences of strings or substrings in the documents that approximately match some string in the dictionary. String membership checking is used in many applications, including collecting customer feedback on products by scanning thousands of s, aggregating ealuations of a moie from many reiews and identifying mentions of entities to extract information from Web pages. For example in DBLife [], the mentions of entities, e.g., researchers, publications and conferences, are located in many Web pages in the database community, and further structured information is collected to build the web portal. There has been a great deal of research on how to efficiently conduct both exact string membership checking [2], [3] and approximate string membership checking [4], [5], [6], [7], [8], [9]. In this paper, we focus on approximate membership checking. A naie approach to the string membership problem is to iteratiely consider each document string (or substring), checking with the dictionary for matches. There is a growing consensus that a better way to do this is to adt a filtererification approach [9], in which some quick approximate filter is constructed based upon the dictionary strings, and then candidate strings in the documents are first passed through the filter and only the strings that pass the filter are subjected to a more expensie erification with the actual dictionary strings. Many filters, including the length filter, counter filter, position filter [8], prefix filter ([7], [9]), LSH filter [], token distribution filter (TDF) [], ISH filter [6], and others, hae been prosed. These filters use only partial information about the dictionary strings (e.g., the length filter [8] uses string lengths while the prefix filter [7] uses prefixes of string tokens.) Since different filters use different information, different filters are good at eliminating strings with different prerties. We suspect that no single filter will dominate eerywhere, as the prerties of the strings to check affect the effectieness and efficiency of the string filters. For example, if each document string has far fewer tokens than eery dictionary string, then the length filter may be the best filter to use; on the other hand, the length filter may not be efficient if each string to check has nearly the same number of tokens as some dictionary string. If it is indeed the case that no single filter dominates all other filters for all problem instances, then gien an approximate string membership checking problem instance and a set of aailable string filters built oer the dictionary strings, we need to select the timal filter for this problem instance to maximize the oerall membership checking performance. A further obseration is that since filters take sets of strings as input and produce sets of strings as output, we can concatenate filters, so that, for example, a string membership checking algorithm could inole applying a filter f to the document strings, then a filter f 2 to the strings that pass f, then performing a final, exact check on only the strings that pass both f and f 2. We found in our experiments that in many cases a sequence of the filters dominates any of the filters in isolation, and that the choice of filters and their ordering in a pipeline affects performance. Because the choice of filters and their ordering in a pipeline affects performance, the multiple filter approach to string membership checking is an timization problem. To the best of our knowledge, this timization problem has not been studied in the literature; studying the problem is the goal of this paper. We describe our exted filter-erification framework with an timizer for approximate string membership checking in Figure I. The exted framework consists of two phases: the
2 Section VI and conclude in Section VII. Fig.. ASMC Framework with Optimizer build phase and the query phase. The build phase is offline, in which we build all the potentially useful string filters and erification erators based on the dictionary strings. Each string filter or erification erator is considered as a tool in the toolkit. More tools could be incrementally added into the toolkit as new filters and erification erators are deeled. The query phase is an online phase, in which we conduct dictionary membership checking for the incoming document strings. The timizer analyzes the characteristics of the document strings, the string filters, and the erification erators. As a result of this analysis, the timizer determines which string filters to use, and how to order those filters. This timization problem is similar to the expensie predicate placement problem in relational database timization [2], [3], and the pipelined filter ordering problem in stream data processing [4], [5]. It is different, howeer, in that there is a final, exact erification step, so all of the filters are tional. We summarize our contributions in this paper as follows. We prose that the approximate string membership checking problem be iewed as an timization problem. We study this timization problem, and proe that in general it is NP -hard. We also study special cases of the problem, and deel seeral approximation algorithms for its solution. We conduct experiments with both synthetic and real data to ealuate (a) the premise that no single filter dominates, that multiple filters can dominate single filters, and that the order of filters affects performance, and (b) the efficacy of our prosed algorithms to sole the timization problem. In the rest of the paper, we start with a surey of the related work in Section II. In Section III, we discuss the basics of the approximate membership checking problem and the filtererification framework. We consider the approximate string membership checking problem as an timization problem of selecting the timal single filter and building the timal filter pipeline using multiple filters separately in Section IV and Section V. Finally we conduct a performance study in II. RELATED WORK Approximate string matching is a classic problem in computer science and many algorithms hae been prosed for its solution [6]. As an extension to many-many comparisons, the approximate string join (or string similarity join) assumes that we hae two string collections and identifies all the approximately matching string pairs, with one string from each collection. In [8], Graano et al. exploit the existing facilities in commercial databases to support approximate string joins based on tokenizing each string into a set of q-grams. In [7], a database primitie query erator ssjoin (set similarity join) is implemented to handle string joins. To efficiently ealuate approximate string joins, most recent work adts the filtererification framework [9], in which we first apply a rough, fast filter, and only do a detailed check for similarity if a string in one collection passes the filter built on the signatures of strings in the other collection. Many signature schemes hae been prosed. The counter filter, length filter and position filter, presented in [8], use the edit distance similarity. The prefix filter was prosed in [7] and later exted in [9]. A noel signature scheme PartEnum, based on partitioning and enumeration was presented in [7], in which by controlling the number of string partitions, we guarantee that any two approximately matching strings must be the same in at least one or more partitions. The token distribution filter in [] compares the token distributions of two strings to determine whether they can approximately match or not. LSH (locality sensitiity hashing) [], exploits the prerty that similar strings hae similar hash alues, but it may not return all the approximately matching strings (the complete result). We do not consider filters that return an incomplete result in this work. Approximate string membership checking ([6], [8], [9], [2]) is another ariation on the approximate string matching problem, in which we iew one string collection as the dictionary and process a second collection of strings (or documents) to check if each string in the second collection approximately matches some dictionary string. In [9], efficient exact algorithms are prosed to conduct approximate string checking based on merging token inerted lists. In [6], the ISH (Inerted Signature-based Hashtable) structure is presented with the focus on reducing the filter checking time. In contrast to the inerted list, the ISH filter stores a list of string signatures, rather than the string ids, for each string token. In [2], the authors check whether a document string can approximately match a dictionary string by merging the inerted lists of tokens for all the tokens in the document string. They focus on progressie computation, discussed in [6], to aoid the redundant computation of merging the same set of inerted lists in checking all the substrings of one document string. Progressie computation is orthogonal to the filtering techniques and we do not consider progressie computation in comparing different filters in our work. Different filters hae been used in an ad hoc fashion to accelerate the membership checking, e.g., the length
3 filter is used in [9], [2]. Howeer, none of the preious work exploits the many existing string filters and uses them systematically. The work in [2], [22] focuses on using cosine similarity metrics based on TF/IDF to approximately match one string against the strings in a large database. As mentioned preiously, we iew the approximate membership checking problem as an timization problem that is closely related to some work in query timization [23], especially the predicate placement problem [2], [3] studied in relational database query timization, and the pipelined set coer problem (or the filter ordering problem) studied in stream data processing [4], [5]. The pipelined set coer problem (or the min-sum set coer problem [24]) has been proen to be NP -hard. Howeer, our problem differs from the predicate placement problem and the pipelined set coer problem in that the final erification step means that all filters are tional; the problem is not apply all these filters in the timal way, instead, it is decide which of these filters to apply and how to order them so that the entire membership checking time is minimized. III. PRELIMINARIES In this section, we introduce the approximate string membership checking problem and describe the filter-erification approach. A. Approximate String Membership Checking (ASMC) Assume we hae a string dictionary R. An input string s is a member of a dictionary R if and only if there exists a string s r R such that sim(s,s r ) τ, where sim(s,s r ) is a string similarity function and τ is a similarity threshold. Many similarity metrics hae been prosed, e.g., edit similarity and Jaccard similarity. Formally, we define the approximate string membership checking (ASMC) problem as in [6]: Definition : Gien a string dictionaryr, a string similarity function sim() and a similarity threshold τ, find eery string or substring s in a string document S such that s r R and sim(s,s r ) τ, in which case we say s approximately matches s r. B. Filter-erification Framework A naie approach to the ASMC problem is to iteratiely take each string or substring s in an input document and compute the similarity between s and eery dictionary string, outputting s as a dictionary member if s approximately matches some dictionary string. Generally this approach is expensie. The filter-erification approach eliminates many candidate strings using filters before computing the string similarity function. As the name suggests, the approach includes two phases: filtering and erification. The filtering phase itself consists of two steps: building a filter oer the dictionary strings, and checking the document strings against the filter. To conduct membership checking for a string document S and a dictionary R, we first build a filter f with R and then check each string in S with f. We say a string s passes a filter f if and only if f cannot guarantee that no string in R matchess. In the erification step, we computesim(s,s r ) for each string s passing the filter f and each string s r R. If sim(s,s r ) τ, we say s approximately matchess r and s is a member of R. With an effectie filter, many fewer candidate strings are checked than would be checked with the naie (no filter) approach. C. The String Filters Most string filters are built based on different string signature schemes, such as using the string length or the prefix [7] as the signature. A filter f built oer a dictionary of strings consists of the signatures of all the dictionary strings, and typically also uses some index structure (e.g., ISH structure [6]) to speed the application of a filter to the strings. We use two prerties to characterize a filter: its filtering cost f c and its filtering ratio f r. Suppose we hae a filter f. Suppose that we hae a set of n in strings as the input to f, and that it takes time t to apply the filter to these strings, and n out strings pass f. Then the filter ratio is ( n out /n in ) and the filtering cost is t/n in. Note that the same filter can hae different f c and f r on different datasets and different filters can hae different f r and f c on the same dataset. IV. USING A SINGLE STRING FILTER In this section, we consider using a single string filter to sole the ASMC problem. As mentioned preiously, many string filters hae been prosed, and different filters hae been designed based on different string information. As a result, the performance (filtering ratio and filtering cost) of a string filter deps on the characteristics of both the document strings and dictionary strings. In the following, we analyze different prerties of three string filters, the length filter, prefix filter and token distribution filter, to motie our work on the timization problem of selecting the timal filter for a gien ASMC problem instance. Length Filter [8]: Assume we hae two strings s i = t si,tsi 2,...,tsi and s j = t sj,tsj 2 m (m ). We,...,tsj represent each string s as a sequence of tokens (e.g., words, phrases or q-grams) t,t 2,...,t l. Assume we use Jaccard similarity with a matching threshold τ to check strings. String s i approximately matches s j only if s i s j / s i s j τ, in which s i s j refers to the number of common tokens ins i and s j, and s i s j refers to the total number of tokens in s i and s j. Correspondingly, we get s i matches s j only if τ s i s j s i /τ. Checking two strings using the length filter means comparing the string lengths, which is ery efficient. Howeer, the length filter can not differentiate strings with similar lengths. For example, two stringss = t,t 2,...,t and s 2 = t,t 2,...,t 2 approximately match based on the length filter. Prefix Filter [7]: Suppose strings s i and s j hae l c common tokens. Based on Jaccard similarity, s i matches s j only if l c /(m+ l c ) τ. Accordingly,s i and s j must hae at least l c = (m+) τ/(+τ) tokens in common to approximately match each other. Therefore, presuming we set the string
4 prefix length to be l p, s i approximately matches s j only if there are at least (l c + l p ) common tokens in the prefixes of s i and s j Here, we assume all the tokens in each string follow a uniersal ordering, and the string prefix coers the first l p tokens in the string. Checking two strings using the prefix filter means measuring the common tokens of the two strings in the string prefix. Therefore, the prefix filter can not differentiate two strings which hae many common tokens in the prefix but few common tokens in other parts of the strings. For example, two strings s = t,t 2,...,t and s 3 = t,t 2,t 3,t,t 2,...,t 7 approximately match according to the prefix filter, assuming we set the prefix length to be 3. Token Distribution Filter []: Suppose there is a partitioning scheme φ on a string token space that splits all the tokens in the space into u partitions (P = P,P 2,...,P u ). We use the number of tokens of a string in each partition as the token distribution of the string. The intuition is that any two approximately matching strings must hae similar token distributions. Suppose a token space is split into u partitions (P = {P,P 2,...,P u }) and a string s i has Px si tokens in partition P x P. Any string s j approximately matches s i based on Jaccard similarity only if s j has at least ( ( si + sj )τ x + s j s i τ +τ ) +τ M out ) tokens and at most( (+τ) Psi tokens in partition P x in which M out = min(( s i Px si ), s j ). For example, suppose we hae an ordered string token space P = [t,t 2,...,t 2 ] split into four partitions: P = [t,...,t 5 ], P 2 = [t 6,...,t ], P 3 = [t,...,t 5 ] and P 4 = [t 6,...,t 2 ]. The token distribution of string s 4 = t,t 2,t,t 2,t 6,t 7 is P {2},P 3 {2},P 4 {2}, which says s 4 has 2 tokens in partition P, 2 tokens in P 3 and 2 tokens inp 4. Strings 5 = t 3,t 4,t 3,t 4,t 8,t 9 has the same token distribution ass 4, een though many tokens are different in the two strings. In the example for prefix filter, strings s and s 3 hae the same prefix but different token distributions. A. The Single Filter Optimization Problem We see that different string filters hae been designed based on comparing different string information, such as string length, prefix or token distribution. The performance of a string filter deps on the characteristics of the dictionary strings and document strings. Accordingly, we define the timization problem in the filter-erification approach to the ASMC problem using a single filter as follows. Gien an ASMC problem instance (including a collection of documents and a dictionary of strings), and a set of aailable string filters and erification erators, the timization problem of using a single filter is to select the timal string filter and erification erator such that the oerall checking time cost of first applying the filter and then the erification erator is minimized. Each filter affects the document strings that need to be checked by a erification erator, thus affecting the erification cost. Assume we hae n string filters and m erification erators, to search for the timal filter and erification erator, we explore a space of n m execution plans. B. Cost Estimation To search for the timal string filter and erification erator, we need to calculate the cost of using each possible plan oer all the document strings. Assuming we use a filter f followed by a erification erator in the plan, the time cost of checking m document strings is m f c +m ( f r ) c, in whichf c is the filtering ratio andf r is the filtering cost, and c is the erification cost of checking each string passing the filter f. The problem is that we do not know f c and f r, and cannot afford to compute them exactly by testing eery string with eery filter. Accordingly, we need some inexpensie technique to estimate f c and f r. We consider two main broadly used estimation approaches: estimation based on some default alues and system statistics ( [25], [23]), and estimation based on sampling techniques ( [26], [27], [28]). Unlike the case with relational database systems, in string matching problems statistics about the collections of strings and how they interact with filters may not be aailable. Furthermore, the performance of a filter deps on both the document and dictionary strings. We may hae a fixed dictionary, but arious documents to check. Accordingly, in this paper, we explore estimation based on sampling techniques [26], specifically, stratified sequential sampling. The basic idea of sequential sampling-based parameter estimation is as follows. Take estimating the filtering cost f c of a single filter f, for example. We split the data (document strings) into m disjoint partitions and we diide the m partitions into k disjoint sets (strata). Then we sequentially sample oer the strata. At each sampling step (j th step), we randomly and uniformly select one data partition (obseration x j i ) from each (i th stratum) of the k strata, and use x j i as the input to the filter to compute the filtering cost f c (x j i ). We repeat this sampling n steps until the following stping conditions satisfy [26]: Ṽn c > and ǫ max(k k i= j= n f c (x j i ),n d) z p (n k Ṽc n) /2 where Ṽ n c is a unbiased and strongly consistent estimator of σ 2 = k k i= σ2 for the estimated filtering costs oer the random obserations; d is a parameter to control the steps of the sampling; Z p refers to Φ (( + p)/2) where Φ is the cumulatie distribution function for a standardized normal random ariable and p is the confidence leel for the estimation error bounded by ǫ m µ d and µ d = max( D /m,d). We get the estimation of f c as f c = m k n f c (x j i k n ) i= j= with the probability p for the error to be within ±ǫ m µ d. Achieing small errors (small ǫ) and high confidence (high p) may require a lot of sampling; in some cases it may be
5 preferable to do less sampling at the expense of larger errors and less confidence. Unless specified otherwise, we use default alues with ǫ =.2 and p =.95. C. Searching for the Optimal Filter The brute force approach to search for the timal filter and erification erator considers eery combination of each string filter and each erification erator, as the erification cost may dep on the strings passing the filter. For each combination of a string filter and a erification erator, we use the sequential sampling technique to estimate the cost of checking all the document strings. Then we output the string filter and erification erator with lowest estimation cost. Assume we hae n string filters and m erification erators, the time complexity of the brute force approach is O(n m s cost ), in which s cost is the sampling cost of estimating the cost of a plan and s cost aries with the parameters ǫ, confidence leel p in the sampling and the cost of using the filters. As the major cost in the timization is the sampling and we may need seeral samplings in the timization, we use the number of samplings required to measure the cost of an timization approach. An additional complication to the timization problem is that the cost of applying the erification erator to a string may ary from string to string. In our work we make the simplifying approximation that the cost per string is uniform. In our experiments we found this to be substantially the case for the data we used; if this is not the case, it will introduce another source of error into the timization process. Whether or not this error is substantial enough to cause timization errors is an interesting area for future work. We design the algorithm of searching for the timal filter and erification erator as follows. First, we estimate the cost per string for each erification erator, and keep the erification erator with lowest estimated cost c. Then, we estimate the filtering ratio f r and filtering cost f c of each string filter f. We select the filter to minimize f c +( f r ) c. The time complexity of this algorithm is O(n s cost +m s cost ). V. THE MULTIPLE FILTER OPTIMIZATION PROBLEM As mentioned preiously, we adt the filter-erification framework to handle the ASMC problem. More importantly, since different filters are good at eliminating different document strings, and filters accept strings as input and produce strings as output, we can concatenate a sequence of filters in a pipeline to function as one composite filter. If we regard the final erification phase as a final, accurate filter, we can consider the problem to be one of finding the most efficient sequence of filters, ing with an accurate filter. Correspondingly, we ext the filter-erification framework for approximate string membership checking by adding an timizer component. The timizer selects the timal sequence of filters, including a erification erator, based on the prerties of the input strings and aailable string filters. Then we use the selected string filters and the erification erator to conduct the membership checking for the input strings. In the following of this section, we describe the timization problem in detail and how we sole the problem. A. Oeriew of Multiple Filter Optimization The timization problem presented by our multiple filter approach to the approximate string matching problem can be stated as follows. Gien a dictionary of strings and a set of n filters F = {f,f 2,...,f n } and a set of m erification eratorsv = {, 2,..., m } built on the dictionary strings, and a similarity function sim and a corresponding matching thresholdτ, the timization problem is to build a pipeline ofk filters and a single erification erator as ˇf, ˇf 2,..., ˇf k,ˇ p (ˇf i F and ˇ p V ) to check the document strings with the least time cost. We note that each filter pipeline should contain one and only one accurate filter (or erification erator) and it is only meaningful to hae it at the of a pipeline. By haing one accurate filter in the pipeline, we guarantee that any document string passing the pipeline is a dictionary member. Suppose we haedcandidate strings, and let ˇf i r and ˇf i c be the filtering ratio and cost of filter ˇf i in the filter pipeline, ˇ p c be the erification cost. We represent the total cost function of checking d strings using the filter pipeline as Where W i = k i= ˇf c i W i + ˇ c p W k { d, i = d i j=2 ( ˇf j r ), i >. To build the timal filter pipeline, we search a space of ( n k! (n k= k) m) pipeline plans, in which n is the number of the filters and m is the number of erification erators. B. Cost Estimation To search for the timal pipeline, we need to estimate the time cost of each filter pipeline plan oer all the document strings. As in our approach to estimating the cost of single filter plans, we use sequential sampling to estimate the filtering ratio and the filtering cost of each filter in a pipeline. C. Searching for the Optimal Pipeline ) Basic Approach: The brute force approach to search for the timal pipeline is to estimate the time cost of using eery possible plan for the document strings and output the plan with the least time cost. The time cost of searching for the timal pipeline in a space of n filters and m erification erators is O( n k! (n k= k) m scost ). We improe the searching performance by iteratiely exploring the plan space using an improed approach as in Algorithm. We first build filter pipelines without considering the erification erators and then we select one erification erator for each pipeline. Finally, we output the pipeline including one erification erator with the least estimated cost as the timal filter pipeline.
6 Algorithm : search for timal pipeline input : n filters F, m erification erators V and the document strings doc output: filter pipeline P set P ={-filter pipelines}; set k=2; foreach k < n do set G k ={All k-filter pipelines}; foreach g k G k do g k = g k - last filter; /* remoe the last filter */ if g k P k then remoe g k from G k ; else c(g k ) = estimate cost (g k, doc ); /* estimate the cost of g k by sampling doc */ foreach g k G k do if c(g k )=min{q k q k G k &q k contains the same set of filters as g k } then add g k into P k set cost = ; foreach pipeline P n k= P k do foreach erification erator V do c = estimate cost (P +, doc); if c < cost then cost = c and P = P +; return P ; To build the pipeline with the set of n filters, in the first pass of Algorithm, we estimate the cost of using eery -filter pipeline and keep them in a setp. A subsequent pass, say pass k, consists of two phases. First, we generate all the candidate timal k-filter pipelines with the estimated timal (k )- filter pipelines based on the prerty that the first (k )-filter pipeline in each timal k-filter pipeline is also timal. The correctness is guaranteed by the fact that the filter ordering in a pipeline with a fixed set of filters does not affect the strings passing the pipeline. Therefore, the sub-pipeline of the first (k ) filters in each timal k-filter pipeline must be an timal (k )-filter pipeline for the set of k filters. Next, we estimate the time cost of each candidate k-filter pipeline and keep the pipeline with the least cost for each different set of k filters in P k. Then, we select erification erators to add to the pipelines. For each pipeline, we estimate the cost of each pipeline plus eery erification erator. Finally we output the pipeline plan (including a erification erator) haing the least estimated cost as the timal filter pipeline plan for the n filters and m erification erators. To build the filter pipeline, in the first pass, we estimate the costs of ( n ) (n ) 2-filter pipelines. In the kth pass, we estimate the costs of ( n k) (n k) (k+)-filter pipelines. Therefore, the total number of samplings is O( n k= (n k) (n k) ). To add the erification erators, we estimate the cost of O( n n k=( k) m) pipelines. The total time cost of the algorithm is O(( n n n k=( k ) m+ (n k) (n k= k) ) scost ). Een though the improed approach is more efficient in searching for the timal filter pipeline than the brute force approach, it is still an exponential algorithm. Unfortunately, it is unlikely we can do better with an timal algorithm, for as we proe next, that the problem of searching for an timal filter pipeline is NP -hard. To proe that this general timal plan searching problem is N P -hard, we first consider a simpler ersion of the searching problem, in which we assume that any filter sps the same time on checking each string. We call this assumption as the uniform cost assumption. Theorem : Suppose we hae a document of strings and n candidate filters and m erification erators. If the uniform cost assumption holds for the filters and erification erators, it is NP -hard to determine the timal filter pipeline. Proof: In the appix we show that the so called Filter Coerage problem is N P -hard. Filter Coerage is a formalization of the problem at hand and hence we get the desired result. As we hae shown that the simpler problem of searching for the timal filter pipeline with the uniform cost assumption in NP -hard, the general searching problem is also NP -hard. 2) Approximation Algorithms: To sole the timization problem efficiently, we design an approximation algorithm in Algorithm 2. Algorithm 2 consists of two stages: the building stage and the refining stage. During the building stage, Algorithm 2 takes iterations to build the filter pipeline incrementally based on a greedy selection strategy. In each iteration, we estimate the filtering ratio and the filtering cost of each filter or erification erator after applying the filters in the pipeline. Then we select the filter or erification eration with the largest ratio of the filtering ratio to the filtering cost to add into the filter pipeline. We continue to add filters into the pipeline until a erification erator is selected. In the refining stage, we start from the last filter in the pipeline and estimate if we can get better performance by remoing the filter from the pipeline. If we can improe the estimated cost of the pipeline, then we remoe it. We continue the checking and remoing until we can not improe the estimated performance. During the building stage, in the worst case, we need(n+) iterations to construct the filter pipeline by using all the filters and one erification erator. In iteration, we make (n + m) samplings to estimate the cost of using each filter and erification erator to check the document strings. During iteration k, we estimate the cost of using the pipeline with each of the n k left filters and m erification erators. In this iteration, we conduct n k+m samplings. Accordingly, the total number of samplings in the building stage is (n 2 + n)/2+(n+) m. In the refining stage, we can make at most n sampling estimations. Therefore, the time complexity of Algorithm 2 is O((n 2 +n m) s cost ).
7 Algorithm 2: search for timal pipeline input : n filters F, m erification erator V and document strings doc output: a filter pipeline P set P =φ; while P contains no erification erator do set cost = ; set =null foreach o (F V) and o P do c(o),r(o) = estimate cost,ratio (o,doc); /* estimate the filtering cost c(o) and ratio r(o) of o after applying P oer doc */ if r(o)/c(o) > cost then set cost = r(o)/c(o) and = o; if V then for (i = ; i < P.size; i++) do P = P last filter; c = estimate cost (P +,doc); c = estimate cost (P +,doc); if c < c then set P = P ; else add to P and return P ; else add to P ; When the uniform cost assumption holds for all the filters and erification erators, we hae the following theorem on the timality of the filter pipeline generated by the greedy selection used in Algorithm 2. Theorem 2: Assume the uniform cost assumption holds for all the filters and erification erators. Also assume that after each filter application the number of dictionary strings in the document is no more than the number of non-dictionary strings, then using the greedy selection in Algorithm 2 achiees a approximation for the timal filter pipeline problem. Proof: Under the aboe assumptions the algorithm gien in the appix is same as Algorithm 2 and hence we can apply Theorem 5 (see Appix). Hence we get the desired approximation guarantee. The assumption that the number of dictionary strings in the document is less than the number of non-dictionary strings is in fact true in most practical scenarios. We note the approximate algorithm to sole the timization problem still takes a quadratic number of sampling estimations, which could be expensie for an timization step of the string membership checking problem. Correspondingly, we present another approximate algorithm in Algorithm 3. The motiation of Algorithm 3 is that a good pipeline should Algorithm 3: search for timal pipeline input : n filters F, m erification erator V and document strings doc output: a filter pipeline P Set P =φ; Set L=φ; foreach o (F V) do if o F then c(o),r(o) = estimate cost,ratio (o,doc); else c(o) = estimate cost (o,doc); add o to L; sort L based on the decreasing order of θ(o) Where { r(o)/c(o), o F θ(o) = /c(o), o V. while L null do select o L with the largest θ(o); if o F then remoe o from L ; add o to P ; else add o to P and return P ; hae filters with high filtering ratio and low filtering cost in the front. To be specific, in Algorithm 3, we first estimate the filtering ratio and filtering cost of each flter indepently. Then we incrementally add string filters and erification erators into the pipeline according to the decreasing order of θ, where θ is the ratio of the filtering ratio to the filtering cost of a filter and θ is the ratio of to the filtering cost of a erification erator. We st when we add one erification erator into the pipeline. In Algorithm 3, we need (n+m) samplings to estimate the filtering ratio and filtering cost of each filter and erification erator. The cost of ordering the filters and erification erators based on the sampling is O((n+m)log (n+m) ). Then we select at most (n+) filters and erification erators to build the pipeline. Therefore the total cost of the algorithm is O((n+m) s cost +(n+m) log (n+m) ). Suppose the uniform cost assumption holds for all the filters and the string data distribution before applying a filter f is the same as that after apply the filter f. In other words, any filter in a pipeline would not affect the filtering ratio and the filtering cost of any other filter appearing later in the pipeline. We say the filters are indepent of each other. In this case, we can accurately estimate the filtering ratio and the filtering cost of each filter indepently, and we can incrementally add filters into the pipeline and we st when we meet a erification erator. We hae Theorem 3 to show that Algorithm 3 find the timal filter pipeline when the filters are indepent and the uniform cost assumption holds for the erification erators. Theorem 3: Assume we hae a set F of n indepent fil-
8 ters and a set V of m erification erators. The timal time cost of checking a set of document strings ia a filter pipeline built of the filters and erification erators is achieed by putting the filters in decreasing order of θ(o), where θ(o) = { f r (o)/f c (o), o F /f c (o), o V. Proof: Obsere that interchanging the order of multiple filters can be decomposed into interchanging the filter order pairwise. Without loss of generality, assume we hae the timal filter pipeline P built using the n filters and m erification erators. Assume that filter o i appears right before filter o j in P. If f r (o i )/f c (o i ) < f r (o j )/f c (o j ), then we can interchange the order of o i and o j and decrease the time cost of using the filter pipeline P. This contradicts that P is the timal filter pipeline. Therefore, all the filters in the timal filter pipeline must be ordered in the decreasing ratio of the filtering ratio to the filtering cost. Similarly, we can show that in the timal filter pipeline any filter o i appearing before a erification erator o j, we must hae f r (o i )/f c (o i ) > /f c (o j ); otherwise, f r (o i )/f c (o i ) /f c (o j ). As we hae proen, on the problem, Algorithm 2 is an approximation algorithm with a proable bound, and for other conditions, Algorithm 3 is timal. When the conditions do not hold, both of these algorithms become heuristics, and we ealuate their performance in the next section. VI. PERFORMANCE STUDY In this section we conducted experiments to explore the timization approach to the approximate string membership problem. We first explore the performance of arious filters and filter pipelines (to proide eidence of the need for an timization-based approach), and then explore the performance of our estimation and timization algorithms. We do this both with synthetic and real data sets. A. Experiment Setup We ran all the experiments on a 2.4 GHZ Intel Duo Core PC with 3GB memory. All the strings filters and the membership checking were implemented in Jaa. We use the time cost to measure the ASMC performance. The cost of ASMC may consist of three components: timization cost (only if we use an timization based approach), filtering cost (only if string filters are used) and erification cost. ) Data Sets: To inestigate the performance of the arious filters and pipelines of filters, we used both both synthetic and real data sets in the experiments. Synthetic Data: We designed a data generator consisting of a dictionary generator and a document generator, which can be customized using two types of parameters: shared parameters, which specify correlations between the generated dictionary and documents; and exclusie parameters, which specify the indiidual prerties of the dictionary generator and document generator. Exclusie parameters Token Space Coerage Token Distribution Number of Strings Aerage (Max, Min) Length Shared parameters Oerall Token Space Token Space Oerlap Oerlap Strings Token Space Coerage refers to the set of tokens that a dictionary or document generator can use from the Oerall Token Space, the uniersal token space used in the data generator. The Token Distribution refers to the token frequency distribution in the document and dictionary strings. Token Space Oerlap refers to the oerlap ratio of the dictionary and document token spaces. We refer to the dictionary strings appearing in the documents as Oerlap Strings. We set default alues for the synthetic data generator parameters as follows. For the dictionary generator: Token Space Coerage (K); Token Distribution (normal); Number of Strings (M); Aerage, Min and Max Length (, 5, 5). This configuration generates a dictionary of M strings composed of K distinct tokens and the frequencies of all the tokens in the dictionary follow a standard normal distribution. The aerage, minimum and maximum lengths of the dictionary strings are, 5 and 5. Similarly, we specify the default alues for the document generator as: Token Space Coerage (K); Token Distribution (normal); Number of Strings (K); Aerage, Min and Max Length (2, 5, 25). Furthermore, we set the Oerall Token Space to contain K tokens, the Token Space Oerlap to be.5 and the number of Oerlap Strings to be 5 by default. Real Data: We used a snapshot of the crawled DBLife [] data (web pages) as the documents and a list of paper titles extracted from the DBLP Bibliography [29] as the dictionary. The DBLife data consists of K web pages and the total data size is approximately 48MB. Many web pages, e.g., researchers homepages, contain some publication mentions, which include paper titles. The dictionary consists of 2K paper titles with a total size.9mb, and it contains about 86K distinct tokens. 2) Filters and Filter Pipeline: We implemented four indiidual string filters and one erification erator, and each is assigned a ID for ease of notation as listed below. We also implemented code to glue them together into pipelines of filters. Length Filter (): We implemented the length filter according to [8]. With the length filter, a string extracted from the documents needs to be erified only if the length of the string is in a range determined by the dictionary strings. Compressed Inerted List Filter (2): We built inerted lists for the dictionary string tokens and we used a fixed n-bit array for each token to store the hash alues of the string ids, rather than using the id list. By adjusting the bit array size n, we can trade filtering cost for accuracy. That is, a larger n leads to a higher cost and higher accuracy, while a smaller n leads to a lower cost and less accuracy. TDF Filter (3): We implemented the token distribution filter in []. We set the default token partitioning number b to be D /, where D is the number of tokens in the Oerall
9 Token Space. We adjust n according to the memory usage, so that the TDF consumes a similar amount of memory to other filters in the experiments. ISH Filter (4): We implemented the ISH filter [6] using prefix signatures. Similar to the inerted lists, ISH stores a list of string signatures for each token rather than string ids. We use k (number of tokens in string prefix) and n (bit-array size for each token) to control the performance and memory usage. We experimented with different alues for k and n and picked k=3 and n=24 since those numbers worked well in our experiments. Inerted List Filter (): This is the only erification erator we considered. We implemented the DiideSkip algorithm [9] to merge the inerted lists of tokens for candidate strings. Filter Pipeline: We cascade different filters to form a filter pipeline used in the multiple filter approach, e.g., a pipeline plan 3, refers to applying the TDF filter and the ISH filter in order before the erification erator. B. Experimental Results ) Using a Single String Filter: First we explore the performance of the indiidual filters, to ascertain whether or not they indeed show different performance. The results are shown in Figure 2(a). That graph has a lot of information on it. The three groups of bars correspond to arying numbers of document strings. Each group contains a bar for each of the filters (plus the erification erator; recall that these filters are only approximate, so they cannot be used without being followed by the erification erator.) For comparison, we also include the performance of only the erification erator indicated by. Finally, the bar labeled t corresponds to our timization procedure described in Section IV-A. The results demonstrate that selecting the correct filter is important, and that our timization approach was effectie, finding the timal filter with only about 2% oerhead. Figure 2(b) shows the same information for arying the token space oerlap. The point made by Figure 2(b) is that the relatie performance of the filters changes with different data sets; the relatie effectiess of the filters is different between Figure 2(a) and Figure 2(b), and also different in the different groups in Figure 2(b). Also note that our timization method is close to the timal in all cases. This ls credence to our belief that no one filter dominates eerywhere, and that our timization-based approach is an effectie way to choose filters. Figure 2(c) makes the same point by arying the dictionary token frequency, the aerage number of each dictionary token. In addition to the synthetic data sets, we also ealuated the ASMC performance using single filters on real data sets with the crawled web pages in DBLife as the documents and the paper titles extracted from the DBLP Bibliography as the dictionary in Figure 2(d) and 2(e). The results in Figure 2(d) and 2(e) oer our real data set confirm the conclusions we got oer the synthetic data sets. To summarize, using different string filters in the ASMC problem may result in different performance. There might be no single filter which is also superior to others in all scenarios. Our timization approach can find the timal filter for arious problem instances we hae studied on both synthetic and real data sets. More importantly, the timization oerhead is relatiely small. 2) Using Multiple Filters: In Figure 3 we moe to consider the case where we use multiple filters in a pipeline (rather than a single filter plus a erification erator.) We wish to explore (a) whether a pipeline of filters can indeed perform better than a single filter, and (b) whether the order of filters in a pipeline affects performance, and (c) whether the best order for a set of filters is constant across different problem instances, and finally (d) how well our arious timization algorithms perform in this enironment. In Figure 3(a),we explore different pipelines on different document sizes. We see that on these data sets, adding filters improes performance, and that using all of the filters is the fastest. Figure 3(b) considers different orders of filters. It shows that the performance is indeed sensitie to order. These two points suggest that some timization-based approach might be effectie. Hence, in Figure 3(c) and 3(d), we ealuate the performance of the timization approach in searching for the timal filter pipeline. The basic approach (Algorithm ) to our timization problem is guaranteed to find the timal filter pipeline, when the sampling based estimation is accurate. Howeer, it requires an exponential number of sampling estimations. For example, een with 4 filters, we need to make more than 3 estimations and each estimation may take %-2% of the oerall cost. Making 3 estimations may take longer than just running a reasonable plan. Therefore, we did not consider the basic approach in our performance study. We only ealuated the performance of the two approximation algorithms (Algorithm 2 and Algorithm 3) to sole the timization problem, which are separately represented as Optimization A and Optimization A2 in Figure 3(c) and 3(d). For comparison, we also show the best performance of the pipelines with, 2, or 4 filters before erification and the worst performance of the 4-filter pipeline. In Figure 3(c), we ary the number of document strings from K to K. We find that both two approximation algorithms can find the timal filter pipeline. Howeer, the timization oerhead of Optimization A is large, i.e., more than 3% of the oerall cost. Due to the large oerhead, Optimization A only achiees similar performance to using the best 2-filter pipeline. Still its performance is much better than that of the worst 4- filter pipeline. The approach Optimization A2 not only finds the timal filter pipeline, but also only add a small amount of timization cost, i.e., about % of the oerall cost. We get similar results when we ary the token space oerlap from.2 to.8 in Figure 3(d). Both approximation algorithms in the timization approach can find the timal filter pipeline for each problem instance, and Optimization A2 requires little cost in searching for the timal filter pipeline than Optimization A. Besides the synthetic data sets, we ealuated the effectieness of the timization approach on real data sets with the
10 ,2 3 25, , 2, 3, Verify Filter Optimization, 2, 3,, 2, 3, , 2, 3, Verify Filter Optimization, 2, 3,, 2, 3, Verify Filter Optimization, 2, 3,, 2, 3,, 2, 3, K 5K K Number of Document Strings (a) Varying Document Size Token Space Oerlap (b) Varying Token Space Oerlap 5, 2, Dictionary String Token Frequency (c) Varying Dictionary Token Frequency 6, 5, 4, 3, 2,,, 2, 3, Verify Filter Optimization, 2, 3,, 2, 3, K 5K K Number of Documents (d) Varying Document Size 7, 6, 5, 4, 3, 2,, Verify Filter Optimization, 2, 3,, 2, 3,, 2, 3, 5K K 2K Number of Dictionary Strings (e) Varying Dictionary Size Fig. 2. ASMC Performance Using Single Filters crawled web pages in DBLife as the documents and the paper titles extracted from the DBLP Bibliography as the dictionary. In Figure 3(e), we show the best performance of using, 2 and 4 filters before the erification and also the worst performance of using 4 filters to compare with performance we achieed by the timization approach using the two approximation algorithms. We ary the number of documents from K to K, while keeping the dictionary size at K. We get similar results on real data sets as in Figure 3(c). The two approximation algorithms indeed find the timal filter pipeline for each case. Optimization A2 achiees the second best performance oer all the approaches. The oerhead of the timization is relatiely small compared with the oerall time cost. To summarize, our experiments show that not only the set of filters but also the ordering of the filters in the pipeline is critical to the ASMC performance. The two approximation algorithms we prosed to sole the timization problem work well in many problem instances we hae studied on both the synthetic and real data sets. In particular, in our experiments Algorithm 3 found the timal pipeline with acceptable oerhead. VII. CONCLUSION The filter-erify approach to the ASMC problem relies upon a powerful but simple obseration: it is often possible to find a relatiely simple characterization of a set of strings D that can be used to efficiently determine that a candidate stringscannot possibly be an approximate match for any string in D. A wide ariety of such characterizations are possible; each leads to a different filter. Furthermore, since different characterizations capitalize on different aspects of the strings, different filters are effectie for different instances of the string membership problem. The multi-filter framework we adocate in this paper exploits this obseration to arrie at what in some sense amounts to a toolkit approach: when faced with an instance of the string membership problem, one looks into one s tookit of filters and tries to assemble a pipeline of these filters that will work well on the specific problem at hand. Of course, choosing the apprriate filters and assembling them in the prer order is in general a non-triial problem (N P -hard); to do so automatically is an timization problem that requires some mechanism for estimating the cost and effectieness of possibly many different potential filtering pipelines. In this paper we hae explored and ealuated this new multifilter, timization-based approach. Considering different estimation cost in the timization approach, we designed two different approximation algorithms to sole the timization problem. We hae found through experiments with synthetic and real data sets that our timization approach oer a toolkit of filters performs better oerall than any of the filters indiidually or een any statically chosen combination of filters. We regard this as promising eidence that the multifilter, timization-based approach has merit. A great deal of room for future work remains. Certainly the set of filters we considered do not exhaust the space of all possible filters; characterizing what constitutes a complete set of filters for the toolkit and possibly deising new filters to achiee this completeness is an interesting and challenging task. Also, while our timization techniques were effectie in our experiments, it is possible that more dynamic approaches
11 ,4,2, , 2,3,,2,3, 3, 2,3,,2,3, Filter 3, 2,3,,2,3, K 5K K Number of Document Strings Verify , 4,3,,2,3, 3,4,,2,,2,4,3, 4,3,,2, Verify Filter 3, 4,3,,2,3, 3,4,,2,,2,4,3, 4,3,,2, 3, 4,3,,2,3, 3,4,,2,,2,4,3, 4,3,,2, K 5K K Number of Document Strings Best filter Best 2 filter Best 4 filter Worst 4 filter Optimization_A Optimization_A2 K 5K K Number of Document Strings (a) Varying Document Size (b) Varying Document Size (c) Varying Document Size Best filter Best 2 filter Best 4 filter Worst 4 filter Optimization_A Optimization_A Token Space Oerlap (d) Varying Token Space Oerlap 4, 3,5 3, 2,5 2,,5, 5 Best filter Best 2 filter Best 4 filter Worst 4 filter Optimization_A Optimization_A2 K 5K K Number of Documents (e) Varying Document Size Fig. 3. ASMC Performance Using Multiple Filters that route candidate strings through networks of filters may outperform single pipelines of filters in interesting cases. Inestigating such approaches is also an interesting direction for exploration. REFERENCES [] DBLife, " [2] A. V. Aho and M. J. Corasick, Efficient string matching: an aid to bibliographic search, in Commun. ACM, 975. [3] B. H. Bloom, Space/time trade-offs in hash coding with allowable errors, Commun. ACM, 97. [4] A. Amir, D. Keselman, G. M. Landau, M. Lewenstein, N. Lewenstein, and M. Rodeh, Text indexing and dictionary matching with one error, in J. Algorithms, 2. [5] A. N. Arslan and O. Egecioglu, Dictionary look-up within small edit distance, in COCOON, 22. [6] K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin, An efficient filter for approximate membership checking, in SIGMOD, 28. [7] S. Chaudhuri, V. Ganti, and R. Kaushik, A primitie erator for similarity joins in data cleaning, in ICDE, 26. [8] L. Graano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Sriastaa, Approximate string joins in a database (almost) for free, in VLDB, 2. [9] C. Xiao, W. Wang, and X. Lin, Ed-join: an efficient algorithm for similarity joins with edit distance constraints, in PVLDB, 28. [] A. Gionis, P. Indyk, and R. Motwani, Similarity search in high dimensions ia hashing, in VLDB, 999. [] C. Sun and J. Naughton, The token distribution filter for approximate string membership checking, in WebDB, 2. [2] J. M. Hellerstein, Practical predicate placement, in SIGMOD Rec., 994. [3] J. M. Hellerstein and M. Stonebraker, Predicate migration: timizing queries with expensie predicates, in SIGMOD Rec., 993. [4] S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom, Adaptie ordering of pipelined stream filters, in SIGMOD, 24. [5] K. Munagala, S. Babu, R. Motwani, and J. Widom, The Pipelined Set Coer Problem, in ICDT, 25. [6] G. Naarro, A guided tour to approximate string matching, in ACM Computing Sureys, 2. [7] A. Arasu, V. Ganti, and R. Kaushik, Efficient exact set-similarity joins, in VLDB, 26. [8] A. Chandel, P. C. Nagesh, and S. Sarawagi, Efficient batch t-k search for dictionary-based entity recognition, in ICDE, 26. [9] C. Li, J. Lu, and Y. Lu, Efficient merging and filtering algorithms for approximate string searches, in ICDE, 28. [2] G. Li, D. Deng, and J. Feng, Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction, in SIGMOD, 2. [2] N. K. Amit, A. Marathe, and D. Sriastaa, Flexible string matching against large databases in practice, in VLDB, 24. [22] M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Sriastaa, Fast indexes and algorithms for set similarity selection queries, in ICDE, 28. [23] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price, Access path selection in a relational database management system, in SIGMOD, 979. [24] U. Feige, L. Loász, and P. Tetali, Approximating min-sum set coer, 23. [25] V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita, Improed histograms for selectiity estimation of range predicates, in SIGMOD Rec., 996. [26] P. J. Haas and A. N. Swami, Sequential sampling procedures for query size estimation, in SIGMOD, 992. [27] W.-C. Hou, G. Ozsoyoglu, and B. K. Taneja, Statistical estimators for relational algebra expressions, in PODS, 988. [28] R. J. Lipton, J. F. Naughton, and D. A. Schneider, Practical selectiity estimation through adaptie sampling, in SIGMOD Rec., 99. [29] DBLP, " ley/db". APPENDIX For a formal treatment, we abstract the timization problem under the uniform cost assumption as minimum cost filter coerage problem which is described as follows. In the filter coerage problem we are gien a set U along with a subset D U, which we call the dictionary. We also hae a set of filters F and erifiers V. Each f F has an associated set S f U\D and a processing cost (per unit element) c f R +. Also each erifier V has an associated processing cost c R + and the set associated with is all of U \ D. For
12 ease of notation, at places, we will denote the set and the cost associated with f as S(f) and c(f) respectiely. Write the filtering cost of applying a set of k filters, F = {f i } i, in order σ as c(f,σ) := k i= c(f σ(i)) ( U i j= S(f σ(j)) ). Also the filtering cost of applying F in order σ and erifier, denoted as c(f,σ,), is defined to be c(f,σ)+c U k i= S fi. The goal of the minimum cost filter coerage problem is to select a subset of filters, F F, and an ordering σ oer F along with a erifier V such that the filtering cost c(f,σ,) is minimized. Oerall an instance of the filter coerage problem is specified as (U,D,F,V). Next we show that the problem is in fact NP -Hard. Theorem 4: The Filter Coerage problem is NP -Hard. Proof: We will reduce the Pipelined Set Coer problem [5] to Filter Coerage. An instance of the Pipelined Set Coer problem consists of a set of elements U and collection of subsets of U, A = {S,...,S k }. For all i [k] a processing cost c i is associated with set S i. In the decision ersion of the problem the objectie is to determine whether there exists an ordering, π, oer the sets such that the pipelined cost is no more than T. Here the pipelined cost is defined to be k i= c π(i) U i j= S π(j). Gien an instance of the Pipelined Set Coer problem (U,A ) we construct an instance, (U, D, F, V), of the Filter Coerage problem as follows. Set U = U and D = φ and the set of filters is constructed from A, in particular for each set S i we construct a filter f i by setting S(f i ) = S i and c(f i ) = c i. We construct a single ersifier with c = T +, by definition we hae S() = U\D = U. In the decision ersion we enquire whether there exists a solution(f,σ,) with filtering cost no more than T. Next we show that a solution, (F,σ,), of filtering cost T exists iff there is an ordering π of pipelined cost T. In the forward direction, say there exits a solution (F,σ,) of cost no more than T. Note that we must hae f F S(f) = U. Otherwise, U f F S(f) and the processing cost of the erifier by itself would be at least T +, contradicting the fact that the oerall filtering cost is no more than T. This gies us a pipelined solution, π, follows: initially set π to be the sets corresponding to the filters in F ordered as in σ and then augment it with the remaining sets of filters in F \ F, in any order. We hae f F S(f) = U and the filtering cost is no more than T. The sets after f F S(f) hae to coer no element hence there processing cost in π is zero, oerall this implies that the pipelined cost of π is no more than T. To proe the other direction, gien π with cost no more than T then we consider (F,π,). In particular, f F = U, hence the erifier processes no element. Since the pipelined cost is no more than T we hae c(f,π) T, which in turn implies that the filtering cost of (F,π,) is no more than T hence we hae the desired claim. Next we present a greedy approximation algorithm for the Filter Coerage problem and show that it achiees an approximation factor of. Gien an instance of the Filter Coerage Problem(U,D,F,V), say D = n and U\D = m. Note that the dictionary D is disjoint from the sets associated with the gien filters and erifiers and hence will be processed by all of them. We essentially apply a set coer like greedy algorithm till the number of uncoered elements not in the dictionary is at least n, after that we pick a erifier with minimum processing cost c. In particular, say at step i we hae M i uncoered elements which are not in the dictionary (initially M = m), if M i n we pick the filter (or erifier) which minimizes the cost ratio c j /m j where m j is the number of uncoered elements in the set associated with the filter and c j is the processing cost of the filter. On the other hand if M i n we select a erifier with the smallest c alue. We hae the following theorem stating the approximation factor achieed by the algorithm. Theorem 5: The aboe algorithm achiees an approximation factor of for the Filter Coerage problem. Proof: Let (F, σ, ) be the solution generated by the algorithm. Say F = k and without loss of generality assume that the filters are selected by the algorithm in order that is f through to f k. Note that the set being processed at step i has cardinality M i + n and hence we get c(f,σ) = k i= c i(m i +n), where c i is the processing cost of filter f i. Moreoer wheneer we select a filter we hae M i n and hencec(f,σ) 2 k i= c im i. Now consider the Pipelined Set Coer [5] instance (U \ D,A ), where A is the collection of sets associated with filters and erifiers. Say the timal pipelined cost of this instance is O p. Note that an timal solution, (F,σ, ), of the Filter Coerage problem can be transformed into a solution for the Pipelined Coer problem. We simply extσ by augmenting sets of filters and erifiers from (F V)\(F + ) in any order. Also s( ) = U \D, hence the processing cost of the augmented sets is zero. We note that the pipelined cost of (F,σ, ) is no more than the filtering cost of (F,σ, ). The filtering cost includes processing all of U whereas the pipelined cost includes processing only U \ D. Write O f = c(f,σ, ), we hae O f O p. The releant obseration is that F, the initial set of filters selected by the algorithm, would be the same if the greedy 4- approximation algorithm for the Pipelined Set Coer [5] was applied to the instance (U \ D,A ). The processing cost of U \D under F is k i= c im i, since this is part of the greedy 4-approximate solution for the Pipelined Set Coer problem we hae k i= c im i 4Op, therefore k i= c im i 4Of. As stated before c(f,σ) 2 k i= c im i and hence c(f,σ) 8Of. Finally we account for the cost incurred by the erifier. Say it is applied afterk filters, at that time we haem k+ n. Hence its processing cost,c (M k+ +n), is no more than2c n. By selection we hae c c. The dictionary D is processed by the erifier in the timal solution, hence Of c n. Oerall we get that c (M k+ +n) 2Of. Since the filtering cost of the solution generated by the algorithm is equal to c(f,σ) plus the processing cost of the erifier we get that the algorithm achiees a cost no more than Of implying an approximation factor of.
Lecture 39: Intro to Differential Amplifiers. Context
Lecture 39: Intro to Differential Amplifiers Prof J. S. Smith Context Next week is the last week of lecture, and we will spend those three lectures reiewing the material of the course, and looking at applications
Lightweight Directory Access Protocol. BladeCenter Management Module and IBM Remote Supervisor Adapters
Lightweight Directory Access Protocol User s Guide for IBM ERserer BladeCenter Management Module and IBM Remote Superisor Adapters Lightweight Directory Access Protocol User s Guide for IBM ERserer BladeCenter
Tracking Object Positions in Real-time Video using Genetic Programming
Tracking Object Positions in Real-time Video using Genetic Programming W. Smart and M. Zhang School of Mathematics, Statistics and Computer Science, Victoria Uniersity of Wellington, P. O. Box 600, Wellington,
IBM License Metric Tool Version 9.0 (includes version 9.0.1, 9.0.1.1 and 9.0.1.2 ) Managing the Software Inventory Guide
IBM License Metric Tool Version 9.0 (includes ersion 9.0.1, 9.0.1.1 and 9.0.1.2 ) Managing the Software Inentory Guide IBM License Metric Tool Version 9.0 (includes ersion 9.0.1, 9.0.1.1 and 9.0.1.2 )
ERserver. iseries. Digital certificate management
ERserer iseries Digital certificate management ERserer iseries Digital certificate management ii iseries: Digital certificate management Contents Part 1. Digital certificate management.....................
Fast Approximation of Steiner Trees in Large Graphs
Fast Approximation of Steiner Trees in Large Graphs Andrey Gubiche Thomas Neumann Technische Uniersität München Garching, Germany {gubiche, neumann}@in.tum.de ABSTRACT Finding the minimum connected subtree
How To Model The Effectiveness Of A Curriculum In Online Education Using Bayesian Networks
International Journal of adanced studies in Computer Science and Engineering Modelling the Effectieness of Curriculum in Educational Systems Using Bayesian Networks Ahmad A. Kardan Department of Computer
ERserver. iseries. Backup, Recovery and Media Services (BRMS)
ERserer iseries Backup, Recoery and Media Serices (BRMS) ERserer iseries Backup, Recoery and Media Serices (BRMS) Copyright International Business Machines Corporation 1998, 2002. All rights resered.
IBM Unica Marketing Operations and Campaign Version 8 Release 6 May 25, 2012. Integration Guide
IBM Unica Marketing Operations and Campaign Version 8 Release 6 May 25, 2012 Integration Guide Note Before using this information and the product it supports, read the information in Notices on page 51.
ERserver. iseries. DB2 Universal Database for iseries - Database Performance and Query Optimization
ERserer iseries DB2 Uniersal Database for iseries - Database Performance and Query Optimization ERserer iseries DB2 Uniersal Database for iseries - Database Performance and Query Optimization Copyright
ERserver. iseries. Journal management
ERserer iseries Journal management ERserer iseries Journal management Copyright International Business Machines Corporation 1998, 2001. All rights resered. US Goernment Users Restricted Rights Use, duplication
IBM Campaign Version 9 Release 1.1 February 18, 2015. User's Guide
IBM Campaign Version 9 Release 1.1 February 18, 2015 User's Guide Note Before using this information and the product it supports, read the information in Notices on page 245. This edition applies to ersion
IBM Unica Campaign Version 8 Release 6 May 25, 2012. User's Guide
IBM Unica Campaign Version 8 Release 6 May 25, 2012 User's Guide Note Before using this information and the product it supports, read the information in Notices on page 223. This edition applies to ersion
IBM Endpoint Manager Version 9.2. Software Use Analysis Managing Software Inventory Guide
IBM Endpoint Manager Version 9.2 Software Use Analysis Managing Software Inentory Guide IBM Endpoint Manager Version 9.2 Software Use Analysis Managing Software Inentory Guide Managing the Software Inentory
IBM Maximo for Aviation MRO Version 7 Release 6. Guide
IBM Maximo for Aiation MRO Version 7 Release 6 Guide Note Before using this information and the product it supports, read the information in Notices on page 185. This edition applies to ersion 7, release
IBM Tealeaf CX Version 9 Release 0.2 June 18, 2015. Tealeaf Databases Guide
IBM Tealeaf CX Version 9 Release 0.2 June 18, 2015 Tealeaf Databases Guide Note Before using this information and the product it supports, read the information in Notices on page 111. This edition applies
Linux OS Agent User s Guide
IBM Tioli Monitoring Version 6.2.3 Fix Pack 1 Linux OS Agent User s Guide SC32-9447-05 IBM Tioli Monitoring Version 6.2.3 Fix Pack 1 Linux OS Agent User s Guide SC32-9447-05 Note Before using this information
Dark Pool Trading Strategies, Market Quality and Welfare. Online Appendix
Dark Pool Trading Strategies, Market Quality and Welfare Online ppendix Sabrina Buti Barbara Rindi y Ingrid M. Werner z December, 015 Uniersity of Toronto, Rotman School of Management, [email protected]
Agilent EEsof EDA. www.agilent.com/find/eesof
Agilent EEsof EDA This document is owned by Agilent Technologies, but is no longer kept current and may contain obsolete or inaccurate references. We regret any inconenience this may cause. For the latest
Extending the Database
Sterling Selling and Fulfillment Foundation Extending the Database Version 91 Sterling Selling and Fulfillment Foundation Extending the Database Version 91 Note Before using this information and the product
cation Management Information Systems (EMIS)
Education Management Information Systems (EMIS) Education Management Information Systems (EMIS) Education Managemen cation Management Information Systems (EMIS) Education Management Information Systems
How To Design A Physical Data Warehouse
IBM DB2 for Linux, UNIX, and Windows Best Practices Physical database design for data warehouse enironments Authors This paper was deeloped by the following authors: Maksym Petrenko DB2 Warehouse Integration
WebSphere Message Broker. Installation Guide. Version7Release0
WebSphere Message Broker Installation Guide Version7Release0 WebSphere Message Broker Installation Guide Version7Release0 About this book This book explains how to install WebSphere Message Broker Version
IBM Tivoli Monitoring Version 6.3 Fix Pack 2. Windows OS Agent Reference
IBM Tioli Monitoring Version 6.3 Fix Pack 2 Windows OS Agent Reference IBM Tioli Monitoring Version 6.3 Fix Pack 2 Windows OS Agent Reference Note Before using this information and the product it supports,
AS/400e. Digital Certificate Management
AS/400e Digital Certificate Management AS/400e Digital Certificate Management ii AS/400e: Digital Certificate Management Contents Part 1. Digital Certificate Management............ 1 Chapter 1. Print
IBM Rapid Restore Ultra Version 4.0. User s Guide
IBM Rapid Restore Ultra Version 4.0 User s Guide IBM Rapid Restore Ultra Version 4.0 User s Guide Notice: Before using this information and the product it supports, be sure to read Notices and Trademarks,
IBM. Performance Management. Sterling B2B Integrator. Version 5.2
Sterling B2B Integrator IBM Performance Management Version 5.2 Sterling B2B Integrator IBM Performance Management Version 5.2 Note Before using this information and the product it supports, read the information
The Advantages and Disadvantages of Network Computing Nodes
Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node
IBM DB2 9.7 for Linux, UNIX, and Windows
IBM DB2 9.7 for Linux, UNIX, and Windows Version 9 Release 7 Data Recoery and High Aailability Guide and Reference Updated September, 2010 SC27-2441-02 IBM DB2 9.7 for Linux, UNIX, and Windows Version
IBM SmartCloud Monitoring - Application Insight. User Interface Help SC27-5618-01
IBM SmartCloud Monitoring - Application Insight User Interface Help SC27-5618-01 IBM SmartCloud Monitoring - Application Insight User Interface Help SC27-5618-01 ii IBM SmartCloud Monitoring - Application
IBM Endpoint Manager for Software Use Analysis Version 9. Scalability Guide. Version 3
IBM Endpoint Manager for Software Use Analysis Version 9 Scalability Guide Version 3 IBM Endpoint Manager for Software Use Analysis Version 9 Scalability Guide Version 3 Scalability Guide This edition
Chapter 13: Query Processing. Basic Steps in Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
Agilent EEsof EDA. www.agilent.com/find/eesof
Agilent EEsof EDA This document is owned by Agilent Technologies, but is no longer kept current and may contain obsolete or inaccurate references. We regret any inconenience this may cause. For the latest
Monitoring: Linux OS Agent Version 6.2.2 Fix Pack 2 (Revised May 2010) User s Guide SC32-9447-03
Tioli Monitoring: Linux OS Agent Version 6.2.2 Fix Pack 2 (Reised May 2010) User s Guide SC32-9447-03 Tioli Monitoring: Linux OS Agent Version 6.2.2 Fix Pack 2 (Reised May 2010) User s Guide SC32-9447-03
IBM Maximo Asset Management Version 7 Release 5. Workflow Implementation Guide
IBM Maximo Asset Management Version 7 Release 5 Workflow Implementation Guide Note Before using this information and the product it supports, read the information in Notices on page 47. This edition applies
IBM Universal Behavior Exchange Toolkit Release 16.1.2 April 8, 2016. User's Guide IBM
IBM Uniersal Behaior Exchange Toolkit Release 16.1.2 April 8, 2016 User's Guide IBM Note Before using this information and the product it supports, read the information in Notices on page 39. This document
Maximizing Revenue with Dynamic Cloud Pricing: The Infinite Horizon Case
Maximizing Reenue with Dynamic Cloud Pricing: The Infinite Horizon Case Hong Xu, Baochun Li Department of Electrical and Computer Engineering Uniersity of Toronto Abstract We study the infinite horizon
Evolutionary Undersampling for Imbalanced Big Data Classification
Eolutionary Undersampling for Imbalanced Big Data Classification I. Triguero, M. Galar, S. Vluymans, C. Cornelis, H. Bustince, F. Herrera and Y. Saeys Abstract Classification techniques in the big data
Evolutionary Undersampling for Imbalanced Big Data Classification
Eolutionary Undersampling for Imbalanced Big Data Classification I. Triguero, M. Galar, S. Vluymans, C. Cornelis, H. Bustince, F. Herrera and Y. Saeys Abstract Classification techniques in the big data
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
On the maximum average degree and the incidence chromatic number of a graph
Discrete Mathematics and Theoretical Computer Science DMTCS ol. 7, 2005, 203 216 On the maximum aerage degree and the incidence chromatic number of a graph Mohammad Hosseini Dolama 1 and Eric Sopena 2
Closed form spread option valuation
Closed form spread option aluation Petter Bjerksund and Gunnar Stensland Department of Finance, NHH Helleeien 30, N-5045 Bergen, Norway e-mail: [email protected] Phone: +47 55 959548, Fax: +47 55
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensie Computing Uniersity of Florida, CISE Department Prof. Daisy Zhe Wang Map/Reduce: Simplified Data Processing on Large Clusters Parallel/Distributed
IBM Unica Campaign Version 8 Release 6 May 25, 2012. Data Migration Guide
IBM Unica Campaign Version 8 Release 6 May 25, 2012 Data Migration Guide Note Before using this information and the product it supports, read the information in Notices on page 49. This edition applies
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
Sterling Store Inventory Management. Concepts Guide. Release 9.2
Sterling Store Inentory Management Concepts Guide Release 9.2 Sterling Store Inentory Management Concepts Guide Release 9.2 Note Before using this information and the product it supports, read the information
Business Intelligence Guide
Sterling Call Center and Sterling Store Business Intelligence Guide Release 9.1.0.10 Sterling Call Center and Sterling Store Business Intelligence Guide Release 9.1.0.10 Note Before using this information
ERserver. Single signon. iseries. Version 5 Release 3
ERserer iseries Single signon Version 5 Release 3 ERserer iseries Single signon Version 5 Release 3 Note Before using this information and the product it supports, be sure to read the information in Notices,
IBM Tivoli Storage Manager for Linux. Quick Start. Version 5 Release 1 GC23-4692-00
IBM Tioli Storage Manager for Linux Quick Start Version 5 Release 1 GC23-4692-00 IBM Tioli Storage Manager for Linux Quick Start Version 5 Release 1 GC23-4692-00 Note! Before using this information and
Planning an Installation
IBM Tioli Composite Application Manager for Application Diagnostics Version 7.1.0.2 Planning an Installation GC27-2827-00 IBM Tioli Composite Application Manager for Application Diagnostics Version 7.1.0.2
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),
Offline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: [email protected] 2 IBM India Research Lab, New Delhi. email: [email protected]
Tivoli Security Compliance Manager
Tioli Security Compliance Manager Version 5.1 Tioli Risk Manager Adapter Guide Tioli Security Compliance Manager Version 5.1 Tioli Risk Manager Adapter Guide Note Before using this information and the
Target Tracking with Monitor and Backup Sensors in Wireless Sensor Networks
Target Tracking with Monitor and Backup Sensors in Wireless Sensor Networks Md. Zakirul Alam Bhuiyan, Guojun Wang,andJieWu School of Information Science and Engineering Central South Uniersity, Changsha
Dock Scheduling Guide
Kewill Transport Dock Scheduling Guide DocumentationDate:8December2014 Documentation Date: 8 December 2014 This edition applies to ersion 6.9.6 of Kewill Transport (formerly the Sterling Transportation
KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
IBM DB2 for Linux, UNIX, and Windows. Best Practices Physical Database Design for Online Transaction Processing (OLTP) environments
IBM DB2 for Linux, UNIX, and Windows Best Practices Physical Database Design for Online Transaction Processing (OLTP) enironments Authors The Physical Database Design for Online Transaction Processing
AIMS AND FUNCTIONS OF FINANCE
1 AIMS AND FUNCTIONS OF FINANCE Introduction Finance Function Importance Concept of Financial Management Nature of Financial Management Scope of Financial Management Traditional Approach Modern Approach
J.L. Kirtley Jr. Electric network theory deals with two primitive quantities, which we will refer to as: 1. Potential (or voltage), and
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.061 Introduction to Power Systems Class Notes Chapter 1: eiew of Network Theory J.L. Kirtley Jr. 1 Introduction
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
BEST/1 Capacity Planning Tool
iseries BEST/1 Capacity Planning Tool Version 5 SC41-5341-01 iseries BEST/1 Capacity Planning Tool Version 5 SC41-5341-01 Note Before using this information and the product it supports, be sure to read
IBM InfoSphere Master Data Management Version 11.4. Overview SC27-6718-00
IBM InfoSphere Master Data Management Version 11.4 Oeriew SC27-6718-00 IBM InfoSphere Master Data Management Version 11.4 Oeriew SC27-6718-00 Note Before using this information and the product that it
January 22, 2015. IBM Digital Analytics Best Practices
January 22, 2015 IBM Digital Analytics Best Practices Note Before using this information and the product it supports, read the information in Notices on page 143. IBM Digital Marketing and Analytics is
ERserver. iseries. Service tools
ERserer iseries Serice tools ERserer iseries Serice tools Copyright International Business Machines Corporation 2002. All rights resered. US Goernment Users Restricted Rights Use, duplication or disclosure
Probabilistic Modeling of Acceleration in Traffic Networks as a Function of Speed and Road Type
Maya Abou-Zeid, Ismail Chabini, Edward K. Nam, and Alessandra Cappiello Probabilistic Modeling of Acceleration in Traffic Networks as a Function of Speed and Road Type Maya Abou-Zeid Massachusetts Institute
Lotus. Notes Version 8.5.2. Lotus Notes Traveler
Lotus Notes Version 8.5.2 Lotus Notes Traeler Lotus Notes Version 8.5.2 Lotus Notes Traeler Note Before using this information and the product it supports, read the information in the Notices section.
ALMOST COMMON PRIORS 1. INTRODUCTION
ALMOST COMMON PRIORS ZIV HELLMAN ABSTRACT. What happens when priors are not common? We introduce a measure for how far a type space is from having a common prior, which we term prior distance. If a type
Strategic Online Advertising: Modeling Internet User Behavior with
2 Strategic Online Advertising: Modeling Internet User Behavior with Patrick Johnston, Nicholas Kristoff, Heather McGinness, Phuong Vu, Nathaniel Wong, Jason Wright with William T. Scherer and Matthew
IBM InfoSphere MDM Web Reports User's Guide
IBM InfoSphere Master Data Management IBM InfoSphere MDM Web Reports User's Guide Version 11 Release 3 GI13-2652-01 IBM InfoSphere Master Data Management IBM InfoSphere MDM Web Reports User's Guide Version
Tivoli Storage Manager for Windows
Tioli Storage Manager for Windows Version 6.1 Installation Guide GC23-9785-01 Tioli Storage Manager for Windows Version 6.1 Installation Guide GC23-9785-01 Note Before using this information and the product
Optimizing Tension Control in Center-Driven Winders
Optimizing Tension Control in Center-rien Winders enis Morozo Siemens Industry, Inc rie Technologies Motion Control 5300 Triangle Parkway Norcross, GA 30092 Tel: 1 (770) 871-3846 E-mail: [email protected]
Chapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
IBM Marketing Operations OnDemand November 17, 2014. Project Manager's Guide
IBM Marketing Operations OnDemand Noember 17, 2014 Project Manager's Guide Note Before using this information and the product it supports, read the information in Notices on page 63. IBM Marketing Operations
Tivoli Endpoint Manager for Patch Management - Windows - User's Guide
Tioli Endpoint Manager for Patch Management - Windows - User's Guide ii Tioli Endpoint Manager for Patch Management - Windows - User's Guide Contents Patch Management for Windows User's Guide................
Differential privacy in health care analytics and medical research An interactive tutorial
Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012 Overview 1. Releasing medical data: What could
Distributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
Analysis of Pollution Impact on Potential Distribution of Suspension Type Discs Insulators in Overhead Lines
Australian Journal of Basic and Applied Sciences, 5(8): 734-742, 2011 ISSN 1991-8178 Analysis of Pollution Impact on Potential Distribution of Suspension Type Discs Insulators in Oerhead Lines R. Ebrahimi,
IBM Tivoli Netcool Performance Manager Wireline Component January 2012 Document Revision R2E1. Pack Upgrade Guide
IBM Tioli Netcool Performance Manager Wireline Component January 2012 Document Reision R2E1 Pack Upgrade Guide Note Before using this information and the product it supports, read the information in Notices
iseries Troubleshooting clusters
iseries Troubleshooting clusters iseries Troubleshooting clusters Copyright International Business Machines Corporation 1998, 2001. All rights resered. US Goernment Users Restricted Rights Use, duplication
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching
A Partition-Based Efficient Algorithm for Large Scale Multiple-Strings Matching Ping Liu Jianlong Tan, Yanbing Liu Software Division, Institute of Computing Technology, Chinese Academy of Sciences, Beijing,
IBM SPSS Statistics 23 Brief Guide
IBM SPSS Statistics 23 Brief Guide Note Before using this information and the product it supports, read the information in Notices on page 87. Product Information This edition applies to ersion 23, release
Near Optimal Solutions
Near Optimal Solutions Many important optimization problems are lacking efficient solutions. NP-Complete problems unlikely to have polynomial time solutions. Good heuristics important for such problems.
