A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching

Transcription

1 A Partition-Based Efficient Algorithm for Large Scale Multiple-Strings Matching Ping Liu Jianlong Tan, Yanbing Liu Software Division, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, Abstract Filtering procedure plays an important role in the Internet security and information retrieval fields, and usually employs multiple-strings matching algorithm as its key part. All the classical matching algorithms, however, perform poorly when the number of the keywords exceeds 5000, which made large scale multiple-strings matching problem a great challenge. Based on the observation that the speed of the classical algorithms depends on the minimal length of the keywords, a partition strategy was proposed to decompose the keywords set into a series of subsets on which the classical algorithms was performed. In the optimal partition, it was proved that the keywords with same length would be separated into one subset, and length of keywords in different subsets would not interlace each other. In this paper, we proposed a shortest-path model for the optimal partition problem. Experiments on both random dataset and ClamAV dataset demonstrated our algorithms works much better than the classical ones. Key Words large scale multiple-strings matching, partition, shortest path 1 Introduction With the development of Internet, more and more information, including bad along with good, appeared and congested the network. To secure Internet and retrieve useful information, filtering systems were designed and deployed on the gateways to filter out bad things. A filtering system usually employs a string matching procedure as its key part, and always contains a large scale keywords set to suit to various focuses. Hence, it is really a great challenge to design an efficient multi-strings matching algorithm for a large scale keywords set. This paper is supported by the NSFC grant No..

2 1.1 Related Work String matching problem has been received extensive research, most of which follow a common procedure, i.e., compare keywords with substring of text within a fixed length window, and then shift the window from left to right as far as possible. According to the way that patterns are compared with the text in the window, string matching algorithms can be categorized into three classes: prefix searching, suffix searching and factor searching. In prefix searching methods, window-shifting is accomplished through computing the longest common prefix between the text and the patterns. There are two ways to compute the length of the longest prefix. The first way is to compute the longest suffix of the text read that is also a prefix of the string, typified by the famous KMP Knuth-Morris-Pratt algorithm [2] and the Aho-Corasick algorithm[3]. The second maintain a set of prefix of keywords that are also suffix of text, and update the set at each character read. This is what the Shift-And and the Shift-Or algorithm do[4]. The suffix searching approach consists in searching backward for a suffix of the window that is also a suffix of the keyword, which features the backward searching from right to left within the window. The Boyer-Moore algorithm[5] is of this kind, and so are the Commentz-Walter algorithm [6]and the Wu-Manber algorithm[7]. Factor searching, the most efficient algorithms in practice for long keywords, can be treated as the integration of the prefix searching and the suffix searching, which search backward in a window like suffix searching, but search for the longest suffix of the window that is also a factor of the keyword. The most famous algorithms of this kind are BDM (Backward DAWG Matching) algorithm[8], BOM (Backward Oracle Matching) algorithm[9] SBDM (Set Backward DAWG Matching) algorithm[10], SBOM (Set Backward Oracle Matching) algorithm[11]. The performance of the classical multi-strings matching algorithms are determined mainly by the following three factors: the number of the keywords, the minimal length of the keywords and the size of the alphabet. In addition, the distribution of keywords in the text would also affect the performance. All of the classical algorithms are inapplicable for the case where there are a large scale of keywords set. Experiment on real data demonstrated that these algorithms perform bad when the number of patterns exceeds 5,000 or the minimal length of the patterns is 4 bytes. In this paper, we proposed a partition-based algorithm suitable for filtering with the large-scale keywords set. 1.2 Our Contribution Our contributions within this paper is as follows:

3 1). We analyzed the average time-complexity of three representative algorithms, i.e., SBOM, WuManber and advanced Aho Corasick, and surveyed the relationship between their speed and the the minimal length of the patterns. 2). We proposed a partition-based strategy to bound the influence of the shortest keywords on the performance. For the speed of the classical algorithms are tightly relative to the minimal length of the keywords, partitioning the keywords set into smaller subset would bound the influence of the shortest keywords, and thus improve the performance. 3). We proposed a shortest-path model of the optimal partition. Two properties of the optimal partition were proved, i.e., the keywords with same length would locate in one subset of the optimal partition, and the length of the keywords of different subset would not interlace each other. Based on the above theorems, a shortest-path model was proposed for the optimal partition. We implemented our partition-based algorithm into a C program and did experiment on real data, which demonstrated its efficiency on large-scale keywords set compared with the classical ones. The rest of this paper are organized as follow: Section 2 describes the average time-complexity and property of three classical algorithms; Section 3 proposes the partition-based strategy and prove two theorems for the optimal partitions; Section 4 reports the implementation and the experimental results on real data; Section 5 mentions the further work. 2 Properties about Performance of Classical Algorithms In this section, we analyze the average-case time complexity of three representative multistrings matching algorithms: SBOM, WuManber, advanced Aho Corasick. Let use Σ to denote the alphabet, n to denote the length of the text, r the number of the pattern, and b to denote the block size in WuManber algorithm. Let us denote m(s) or m the minimal length of a keywords set S, and M(S) or M the maximal one. Assuming an uniform distribution of text and keywords over the alphabet Σ, and that n is large enough, it is easy to estimate the average-case time complexity of Advanced Aho Corasick algorithm is O(n), WuManber algorithm is O ( ) n, and SBOM (m b+1) (1 (m b+1) r 2 Σ b ) algorithm is O ( n log Σ mr ). m log Σ mr The above analysis implies the following properties about the speed of the classical algorithms, which are confirmed by experimental result on real data: Property 1 The main factors affecting the speed of multi-strings matching algorithms are the size of alphabet, the number of the patterns and the minimal length of the patterns. Hence,

4 the matching time can be denoted as T (r, m) if the size of alphabet is fixed. Property 2 The matching time of a multi-strings matching algorithm increases monotonously when the number of the patterns increase, i.e., T r(r, m) > 0. (See Fig. 1) Property 3 The matching time of a multi-strings matching algorithms decreases monotonously when minimal length of the patterns increase, i.e., T m(r, m) < 0. (See Fig. 2) Property 4 The increase rate of the matching time with the number of patterns is independent of the number of the patterns, and increase when the minimal length decrease, i.e., F x(x, y) = H(y) > 0andH (y) < 0. 3 A Partition-Based Matching Algorithm The shortest keywords, though very small in quantity, have a great affect on the matching time. To bound their influence, an intuitional idea is to decompose the keywords set into a series of smaller subsets, and then choose an appropriate classical matching algorithms to run on each subset. Since the influence of the shortest keywords is bounded in the smaller subset rather than the entire set, the sum of matching time on individual subsets is even smaller than the time costed to run on an entire set directly. For a given keywords set P = {p 1, p 2, p 3,, p n }, there are many kinds of feasible partitions, among which the optimal one will bring the minimal matching time. Here, we assumed that the keywords was already sorted according to its length, i.e., p 1 p 2 p n. Then the optimal partition finding problem can be defined as follows: Optimal Partition Finding Problem Given a sorted keywords set P = {p 1, p 2,, p n }, k to construct a partition S 1, S 2,, S k, so that S i = P, 1 i, j k, S i Sj =, and k T (m(s i ), S i ) is minimized. i=1 3.1 Properties about Optimal Partition In the following, two properties about the optimal partition are proved to provide solid foundation for finding it. Theorem 1 There exist an optimal partition, S 1, S 2,, S k, of the sorted keywords set P = {p 1, p 2,, p n }, and for i j either a S i, b S j, a b ; or a S i, b S j, a b ; This theorem demonstrates the continuum of the optimal partition, that is, the intervals formed by the subset, [m(s i ), M(S i )], would not intercross each other. For the sake of simplicity, we use m i and m j to denote the minimal length of S i and S j, n i and n j the number of keywords in them, respectively. i=1

5 Proof: Suppose in an optimal partition S 1, S 2,, S k, there exist two subsets S i and S j with intercrossing interval, i.e.,[m(s i ), M(S i )] intercross with [m(s j ), M(S j )]. We give proof for the case m i m j, and the case m i m j is similar thus omitted. Two new subset S i and S j were constructed by exchanging the longest keyword of S i with the shortest one in S j. Thus, we have: m i = m i, m j m j ; n i = n i, n j = n j ; Hence, (T (m i, n i ) + T (m j, n j )) (T (m i, n i ) + T (m j, n j )) = (T (m i, n i ) T (m i, n i )) + (T (m j, n j ) T (m j, n j )) 0 (Property 3 in section 2, and m(s j ) > m(s j )) Thus, the new partition is better than the original one, which contradicts with the assumption. Hence, the theorem holds. Theorem 2 In the optimal partition of the sorted keywords set P = {p 1, p 2, p n }, keywords with the same length would not disperse into different subsets. Theorem 1 assures that the length of the keywords in different subsets would not interlace, hence the keywords with same length would not disperse into more than two subsets. So, it suffices to complete the proof for the case of two subsets. Proof. In an optimal partition S 1, S 2,, S k, suppose the C keywords with same length l were split into two subsets, i.e., C i keywords in S i and C j in S j (C i > 0, C j > 0, C i + C j = C). For S i and S j, two new subsets, S i and S j, could be constructed through removing the C i keywords from S i to S j. For the sake of simplicity, we use m i and m j to denote the minimal length of S i and S j, n i and n j the number of keywords in them, respectively. We give proof for the case m(s i ) m(s j ), and the case m(s i ) m(s j ) is similar thus omitted. In this case, we have m i = m i, m j = m j ; n i = n i C i, n j = n j + C i ; Hence, (T (m i, n i ) + T (m j, n j )) (T (m i, n i ) + T (m j, n j )) = (T (m i, n i ) T (m i, n i C i )) + (T (m j, n j ) T (m j, n j + C i )) = C i n T (m i, n i δ 1 C i ) C i n T (m j, n j + δ 2 C i ) (0 δ 1 1, 0 δ 2 1) 0 (Propety 4 in section 2, and m1 < m2) Thus, the new partition is better than the original one, which conflict with the assumption. Therefore, in an optimal partition, the keywords with same length would be in one subset. The above two properties imply that the keywords with same length work as a block, that is, they would not separate in an optimal partition. Moreover, a subset S i of an optimal partition contains all the blocks with length in the interval [m(s i ), M(S i )].

6 3.2 Algorithm to Find the Optimal Partition In this section, we model the optimal partition problem into finding the shortest-path problem in a weighted graph. Given a sorted keywords set P = p 1, p 2,, p n, we create a partition graph G as follows. For each a block with length i in P, a node N i is created to represent it, and an auxiliary node N M(P )+1 is created to represent the end of P. Let V = {N m(p ), N m(p )+1,, N M(P ), N M(P )+1 }. The edges of G is specified as follows. For N i and N j V, there is an edge from N i to N j, denoted as (N i, N j ) if i < j. In deed, an edge (N i, N j ) is used to represent a subset containing blocks with length greater than or equal with N i, but less than N j. For each edge (N i, N j ), a weight W (N i, N j ) was assigned to measure the benefit of setting the corresponding blocks as a subset. The matching time of a representative text on the subset was used as an estimation of W (N i, N j ). Therefore, the optimal partition correspond to the shortest path in the partition graph G. An example is shown in Fig The short path in the graph is 2 > 6 > 8, hence the optimal partition has two subset, one containing keywords with length 2,3,4,5, and the other having keywords with length 6,7. The algorithm to find the optimal partition is given as follows: Algorithm to Find the Optimal Partition Input: A sorted keywords P, a representative text T ; Output: The optimal partition of P ; 1. Construct the partition graph G =< V, E >, here, V = {N m(p ), N m(p )+1,, N M(P ), N M(P )+1 }; E = {(N i, N j ) N i, N j V, i < j}.. W (N i, N j ) is set as above; 2. Finding the shortest path (e i1, e i2,, e ik ) from N m(p ) to N M(P )+1. Here, e ij is an edge in G; 3. For each e ij, output a subset containing the corresponding blocks; 4 Experimental Result 4.1 Results on Random Data Set In the test on the random data, the size of the alphabet is 32. A program is made to build the pattern randomly. We build two group patterns which number are 5000, and length are from 4 bytes to 40 bytes. The random patterns are symmetrical in the number of the length. The search text is also build randomly. It s size about 200M.

7 It can be seen from above that the speed of the WuManber algorithm is much more quick than the other two algorithms among the three classical algorithms. Yet when the number of the pattern number is increase rapidly, WuManber s speed decrease more quickly than the other two algorithms. In both of the two group experiments, the speed of the COM algorithm is the most best. In the COM algorithm the positions of the partition and the basic algorithms are different in the test patterns. By the all, the speed of the COM algorithm could increase averagely 1-3 times than the classical algorithms. The larger the number of the patterns is, the larger the increase is. 4.2 Results on Real Data Set The best test is to use the real patterns from real systems. In this test we use two groups data: one group is extracted from Snort, the other is extracted from the signatures of ClamAV. Snort is a open source IDS. It s last version can download from We use the version and extracted 2086 patterns which lengths are larger than 1 byte. ClamAntiVirus is a open source AntiVirus system. It s virus database is updated everyday and the last version can download from We use the version 0.83 and extracted patterns which no wildcards in them. The training text is from MIT, a group data of the real network, which are used to evaluate the capability of IDS. The data set can download from We use the file mit 1999 training week1 Friday inside.dat. We cut off the file from 64M to 16M for quickly training. The matched text is mit 1999 training week1 Friday inside.dat, about 64M. In the follow table the left part are the distributing of pattern lengths and the right parts

8 are the test results. It can be seen from above that in COM the optical partition on Snort patterns is only one group, use Wumanber algorithm. This mean that COM is not must better than the classical algorithms on some special pattern sets. On the other side this also mean that the speed of COM must not slower than the speed of the classical algorithms, and the speed at least equal to the most quick one of the classical algorithms, generally more quickly. The superiority of COM is distinct on the ClamAV patterns. When the length range of patterns is large and the length distributing is very asymmetrical, use partition strategy can increase the speed of matching. The increase of the matching speed is more obviously with the increase of the pattern number.

9 5 Conclusions Conclusion is here. Acknowledgment: The author expresses him deep appreciation to his advisors for help on the subject of this paper. References [1] Gonzalo Navarro and Mathieu Raffinot, Flexible Pattern Matching in Strings Practical on-line search algorithms for texts and biological sequences, Camedge University Press,2002,ISBN pp74 76 [2] D.E.Knuth,J.H.Morris,V.R.Pratt, Fast Pattern Matching in Strings,SIAM Journal on Computing,Page ,1977 [3] A.V. Aho and M.J.Corasick, Efficient string matching:an aid to bibliographic search, Communication of the ACM,18(6): ,1975 [4] S. Wu, U. Manber, Fast text searching allowing errors,communications of the ACM, 35(10): 83 91,1992 [5] R.S.Boyer, J.S.Moore, A fast string searching algorithm,communications of the ACM,20(10): ,1977 [6] B.Commentz-Walter, A string matching algorithm fast on the average, In Proceeding s of the 6th International Colloquium on Automata, Language and Programming, number 71 in Lecture Notes in Computer Science,pages ,1979 [7] S.Wu, U.Manber, A fast algorithm for multi-pattern searching, Report TR 94 17, Department of Computer Science, University of Arizona,Tucson, AZ,1994 [8] M.Crochemore, A.Czumaj, L.Gasienniec, S.Jarominek, T.Lecroq, W.Plandowski, W.Rytter, Speeding up two string matching algorithms, Algorithmica,12(4/5): ,1994 [9] C.Allauzen, M.Crochemore, M.Raffinot, Efficient experimental string matching by weak factor recognition, In proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, number 2089 in Lecture Notes in Computer Science,pages Springer-Verlag,2001 [10] A.Blumer, J.Blumer, A.Ehrenfeucht, D.Haussler, R.McConnel. Complete inverted files for efficient text retrieval and analysis, Jonual of the ACM,34(3): ,1987 [11] C.Allauzen, M.Raffinot Factor oracle of a set of words,technical report 99 11, Institute Gaspard- Monge, University de Marne-la-vallee, 1999 [12] Xiaodong Wang The Design and Analysis of Computer Algorithms Publishing House of Electronic Industry, Beijing ISBN P