A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching
|
|
- Aleesha Gibbs
- 8 years ago
- Views:
Transcription
1 A Partition-Based Efficient Algorithm for Large Scale Multiple-Strings Matching Ping Liu Jianlong Tan, Yanbing Liu Software Division, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, Abstract Filtering procedure plays an important role in the Internet security and information retrieval fields, and usually employs multiple-strings matching algorithm as its key part. All the classical matching algorithms, however, perform poorly when the number of the keywords exceeds 5000, which made large scale multiple-strings matching problem a great challenge. Based on the observation that the speed of the classical algorithms depends on the minimal length of the keywords, a partition strategy was proposed to decompose the keywords set into a series of subsets on which the classical algorithms was performed. In the optimal partition, it was proved that the keywords with same length would be separated into one subset, and length of keywords in different subsets would not interlace each other. In this paper, we proposed a shortest-path model for the optimal partition problem. Experiments on both random dataset and ClamAV dataset demonstrated our algorithms works much better than the classical ones. Key Words large scale multiple-strings matching, partition, shortest path 1 Introduction With the development of Internet, more and more information, including bad along with good, appeared and congested the network. To secure Internet and retrieve useful information, filtering systems were designed and deployed on the gateways to filter out bad things. A filtering system usually employs a string matching procedure as its key part, and always contains a large scale keywords set to suit to various focuses. Hence, it is really a great challenge to design an efficient multi-strings matching algorithm for a large scale keywords set. This paper is supported by the NSFC grant No..
2 1.1 Related Work String matching problem has been received extensive research, most of which follow a common procedure, i.e., compare keywords with substring of text within a fixed length window, and then shift the window from left to right as far as possible. According to the way that patterns are compared with the text in the window, string matching algorithms can be categorized into three classes: prefix searching, suffix searching and factor searching. In prefix searching methods, window-shifting is accomplished through computing the longest common prefix between the text and the patterns. There are two ways to compute the length of the longest prefix. The first way is to compute the longest suffix of the text read that is also a prefix of the string, typified by the famous KMP Knuth-Morris-Pratt algorithm [2] and the Aho-Corasick algorithm[3]. The second maintain a set of prefix of keywords that are also suffix of text, and update the set at each character read. This is what the Shift-And and the Shift-Or algorithm do[4]. The suffix searching approach consists in searching backward for a suffix of the window that is also a suffix of the keyword, which features the backward searching from right to left within the window. The Boyer-Moore algorithm[5] is of this kind, and so are the Commentz-Walter algorithm [6]and the Wu-Manber algorithm[7]. Factor searching, the most efficient algorithms in practice for long keywords, can be treated as the integration of the prefix searching and the suffix searching, which search backward in a window like suffix searching, but search for the longest suffix of the window that is also a factor of the keyword. The most famous algorithms of this kind are BDM (Backward DAWG Matching) algorithm[8], BOM (Backward Oracle Matching) algorithm[9] SBDM (Set Backward DAWG Matching) algorithm[10], SBOM (Set Backward Oracle Matching) algorithm[11]. The performance of the classical multi-strings matching algorithms are determined mainly by the following three factors: the number of the keywords, the minimal length of the keywords and the size of the alphabet. In addition, the distribution of keywords in the text would also affect the performance. All of the classical algorithms are inapplicable for the case where there are a large scale of keywords set. Experiment on real data demonstrated that these algorithms perform bad when the number of patterns exceeds 5,000 or the minimal length of the patterns is 4 bytes. In this paper, we proposed a partition-based algorithm suitable for filtering with the large-scale keywords set. 1.2 Our Contribution Our contributions within this paper is as follows:
3 1). We analyzed the average time-complexity of three representative algorithms, i.e., SBOM, WuManber and advanced Aho Corasick, and surveyed the relationship between their speed and the the minimal length of the patterns. 2). We proposed a partition-based strategy to bound the influence of the shortest keywords on the performance. For the speed of the classical algorithms are tightly relative to the minimal length of the keywords, partitioning the keywords set into smaller subset would bound the influence of the shortest keywords, and thus improve the performance. 3). We proposed a shortest-path model of the optimal partition. Two properties of the optimal partition were proved, i.e., the keywords with same length would locate in one subset of the optimal partition, and the length of the keywords of different subset would not interlace each other. Based on the above theorems, a shortest-path model was proposed for the optimal partition. We implemented our partition-based algorithm into a C program and did experiment on real data, which demonstrated its efficiency on large-scale keywords set compared with the classical ones. The rest of this paper are organized as follow: Section 2 describes the average time-complexity and property of three classical algorithms; Section 3 proposes the partition-based strategy and prove two theorems for the optimal partitions; Section 4 reports the implementation and the experimental results on real data; Section 5 mentions the further work. 2 Properties about Performance of Classical Algorithms In this section, we analyze the average-case time complexity of three representative multistrings matching algorithms: SBOM, WuManber, advanced Aho Corasick. Let use Σ to denote the alphabet, n to denote the length of the text, r the number of the pattern, and b to denote the block size in WuManber algorithm. Let us denote m(s) or m the minimal length of a keywords set S, and M(S) or M the maximal one. Assuming an uniform distribution of text and keywords over the alphabet Σ, and that n is large enough, it is easy to estimate the average-case time complexity of Advanced Aho Corasick algorithm is O(n), WuManber algorithm is O ( ) n, and SBOM (m b+1) (1 (m b+1) r 2 Σ b ) algorithm is O ( n log Σ mr ). m log Σ mr The above analysis implies the following properties about the speed of the classical algorithms, which are confirmed by experimental result on real data: Property 1 The main factors affecting the speed of multi-strings matching algorithms are the size of alphabet, the number of the patterns and the minimal length of the patterns. Hence,
4 the matching time can be denoted as T (r, m) if the size of alphabet is fixed. Property 2 The matching time of a multi-strings matching algorithm increases monotonously when the number of the patterns increase, i.e., T r(r, m) > 0. (See Fig. 1) Property 3 The matching time of a multi-strings matching algorithms decreases monotonously when minimal length of the patterns increase, i.e., T m(r, m) < 0. (See Fig. 2) Property 4 The increase rate of the matching time with the number of patterns is independent of the number of the patterns, and increase when the minimal length decrease, i.e., F x(x, y) = H(y) > 0andH (y) < 0. 3 A Partition-Based Matching Algorithm The shortest keywords, though very small in quantity, have a great affect on the matching time. To bound their influence, an intuitional idea is to decompose the keywords set into a series of smaller subsets, and then choose an appropriate classical matching algorithms to run on each subset. Since the influence of the shortest keywords is bounded in the smaller subset rather than the entire set, the sum of matching time on individual subsets is even smaller than the time costed to run on an entire set directly. For a given keywords set P = {p 1, p 2, p 3,, p n }, there are many kinds of feasible partitions, among which the optimal one will bring the minimal matching time. Here, we assumed that the keywords was already sorted according to its length, i.e., p 1 p 2 p n. Then the optimal partition finding problem can be defined as follows: Optimal Partition Finding Problem Given a sorted keywords set P = {p 1, p 2,, p n }, k to construct a partition S 1, S 2,, S k, so that S i = P, 1 i, j k, S i Sj =, and k T (m(s i ), S i ) is minimized. i=1 3.1 Properties about Optimal Partition In the following, two properties about the optimal partition are proved to provide solid foundation for finding it. Theorem 1 There exist an optimal partition, S 1, S 2,, S k, of the sorted keywords set P = {p 1, p 2,, p n }, and for i j either a S i, b S j, a b ; or a S i, b S j, a b ; This theorem demonstrates the continuum of the optimal partition, that is, the intervals formed by the subset, [m(s i ), M(S i )], would not intercross each other. For the sake of simplicity, we use m i and m j to denote the minimal length of S i and S j, n i and n j the number of keywords in them, respectively. i=1
5 Proof: Suppose in an optimal partition S 1, S 2,, S k, there exist two subsets S i and S j with intercrossing interval, i.e.,[m(s i ), M(S i )] intercross with [m(s j ), M(S j )]. We give proof for the case m i m j, and the case m i m j is similar thus omitted. Two new subset S i and S j were constructed by exchanging the longest keyword of S i with the shortest one in S j. Thus, we have: m i = m i, m j m j ; n i = n i, n j = n j ; Hence, (T (m i, n i ) + T (m j, n j )) (T (m i, n i ) + T (m j, n j )) = (T (m i, n i ) T (m i, n i )) + (T (m j, n j ) T (m j, n j )) 0 (Property 3 in section 2, and m(s j ) > m(s j )) Thus, the new partition is better than the original one, which contradicts with the assumption. Hence, the theorem holds. Theorem 2 In the optimal partition of the sorted keywords set P = {p 1, p 2, p n }, keywords with the same length would not disperse into different subsets. Theorem 1 assures that the length of the keywords in different subsets would not interlace, hence the keywords with same length would not disperse into more than two subsets. So, it suffices to complete the proof for the case of two subsets. Proof. In an optimal partition S 1, S 2,, S k, suppose the C keywords with same length l were split into two subsets, i.e., C i keywords in S i and C j in S j (C i > 0, C j > 0, C i + C j = C). For S i and S j, two new subsets, S i and S j, could be constructed through removing the C i keywords from S i to S j. For the sake of simplicity, we use m i and m j to denote the minimal length of S i and S j, n i and n j the number of keywords in them, respectively. We give proof for the case m(s i ) m(s j ), and the case m(s i ) m(s j ) is similar thus omitted. In this case, we have m i = m i, m j = m j ; n i = n i C i, n j = n j + C i ; Hence, (T (m i, n i ) + T (m j, n j )) (T (m i, n i ) + T (m j, n j )) = (T (m i, n i ) T (m i, n i C i )) + (T (m j, n j ) T (m j, n j + C i )) = C i n T (m i, n i δ 1 C i ) C i n T (m j, n j + δ 2 C i ) (0 δ 1 1, 0 δ 2 1) 0 (Propety 4 in section 2, and m1 < m2) Thus, the new partition is better than the original one, which conflict with the assumption. Therefore, in an optimal partition, the keywords with same length would be in one subset. The above two properties imply that the keywords with same length work as a block, that is, they would not separate in an optimal partition. Moreover, a subset S i of an optimal partition contains all the blocks with length in the interval [m(s i ), M(S i )].
6 3.2 Algorithm to Find the Optimal Partition In this section, we model the optimal partition problem into finding the shortest-path problem in a weighted graph. Given a sorted keywords set P = p 1, p 2,, p n, we create a partition graph G as follows. For each a block with length i in P, a node N i is created to represent it, and an auxiliary node N M(P )+1 is created to represent the end of P. Let V = {N m(p ), N m(p )+1,, N M(P ), N M(P )+1 }. The edges of G is specified as follows. For N i and N j V, there is an edge from N i to N j, denoted as (N i, N j ) if i < j. In deed, an edge (N i, N j ) is used to represent a subset containing blocks with length greater than or equal with N i, but less than N j. For each edge (N i, N j ), a weight W (N i, N j ) was assigned to measure the benefit of setting the corresponding blocks as a subset. The matching time of a representative text on the subset was used as an estimation of W (N i, N j ). Therefore, the optimal partition correspond to the shortest path in the partition graph G. An example is shown in Fig The short path in the graph is 2 > 6 > 8, hence the optimal partition has two subset, one containing keywords with length 2,3,4,5, and the other having keywords with length 6,7. The algorithm to find the optimal partition is given as follows: Algorithm to Find the Optimal Partition Input: A sorted keywords P, a representative text T ; Output: The optimal partition of P ; 1. Construct the partition graph G =< V, E >, here, V = {N m(p ), N m(p )+1,, N M(P ), N M(P )+1 }; E = {(N i, N j ) N i, N j V, i < j}.. W (N i, N j ) is set as above; 2. Finding the shortest path (e i1, e i2,, e ik ) from N m(p ) to N M(P )+1. Here, e ij is an edge in G; 3. For each e ij, output a subset containing the corresponding blocks; 4 Experimental Result 4.1 Results on Random Data Set In the test on the random data, the size of the alphabet is 32. A program is made to build the pattern randomly. We build two group patterns which number are 5000, and length are from 4 bytes to 40 bytes. The random patterns are symmetrical in the number of the length. The search text is also build randomly. It s size about 200M.
7 It can be seen from above that the speed of the WuManber algorithm is much more quick than the other two algorithms among the three classical algorithms. Yet when the number of the pattern number is increase rapidly, WuManber s speed decrease more quickly than the other two algorithms. In both of the two group experiments, the speed of the COM algorithm is the most best. In the COM algorithm the positions of the partition and the basic algorithms are different in the test patterns. By the all, the speed of the COM algorithm could increase averagely 1-3 times than the classical algorithms. The larger the number of the patterns is, the larger the increase is. 4.2 Results on Real Data Set The best test is to use the real patterns from real systems. In this test we use two groups data: one group is extracted from Snort, the other is extracted from the signatures of ClamAV. Snort is a open source IDS. It s last version can download from We use the version and extracted 2086 patterns which lengths are larger than 1 byte. ClamAntiVirus is a open source AntiVirus system. It s virus database is updated everyday and the last version can download from We use the version 0.83 and extracted patterns which no wildcards in them. The training text is from MIT, a group data of the real network, which are used to evaluate the capability of IDS. The data set can download from We use the file mit 1999 training week1 Friday inside.dat. We cut off the file from 64M to 16M for quickly training. The matched text is mit 1999 training week1 Friday inside.dat, about 64M. In the follow table the left part are the distributing of pattern lengths and the right parts
8 are the test results. It can be seen from above that in COM the optical partition on Snort patterns is only one group, use Wumanber algorithm. This mean that COM is not must better than the classical algorithms on some special pattern sets. On the other side this also mean that the speed of COM must not slower than the speed of the classical algorithms, and the speed at least equal to the most quick one of the classical algorithms, generally more quickly. The superiority of COM is distinct on the ClamAV patterns. When the length range of patterns is large and the length distributing is very asymmetrical, use partition strategy can increase the speed of matching. The increase of the matching speed is more obviously with the increase of the pattern number.
9 5 Conclusions Conclusion is here. Acknowledgment: The author expresses him deep appreciation to his advisors for help on the subject of this paper. References [1] Gonzalo Navarro and Mathieu Raffinot, Flexible Pattern Matching in Strings Practical on-line search algorithms for texts and biological sequences, Camedge University Press,2002,ISBN pp74 76 [2] D.E.Knuth,J.H.Morris,V.R.Pratt, Fast Pattern Matching in Strings,SIAM Journal on Computing,Page ,1977 [3] A.V. Aho and M.J.Corasick, Efficient string matching:an aid to bibliographic search, Communication of the ACM,18(6): ,1975 [4] S. Wu, U. Manber, Fast text searching allowing errors,communications of the ACM, 35(10): 83 91,1992 [5] R.S.Boyer, J.S.Moore, A fast string searching algorithm,communications of the ACM,20(10): ,1977 [6] B.Commentz-Walter, A string matching algorithm fast on the average, In Proceeding s of the 6th International Colloquium on Automata, Language and Programming, number 71 in Lecture Notes in Computer Science,pages ,1979 [7] S.Wu, U.Manber, A fast algorithm for multi-pattern searching, Report TR 94 17, Department of Computer Science, University of Arizona,Tucson, AZ,1994 [8] M.Crochemore, A.Czumaj, L.Gasienniec, S.Jarominek, T.Lecroq, W.Plandowski, W.Rytter, Speeding up two string matching algorithms, Algorithmica,12(4/5): ,1994 [9] C.Allauzen, M.Crochemore, M.Raffinot, Efficient experimental string matching by weak factor recognition, In proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, number 2089 in Lecture Notes in Computer Science,pages Springer-Verlag,2001 [10] A.Blumer, J.Blumer, A.Ehrenfeucht, D.Haussler, R.McConnel. Complete inverted files for efficient text retrieval and analysis, Jonual of the ACM,34(3): ,1987 [11] C.Allauzen, M.Raffinot Factor oracle of a set of words,technical report 99 11, Institute Gaspard- Monge, University de Marne-la-vallee, 1999 [12] Xiaodong Wang The Design and Analysis of Computer Algorithms Publishing House of Electronic Industry, Beijing ISBN P
Fast string matching
Fast string matching This exposition is based on earlier versions of this lecture and the following sources, which are all recommended reading: Shift-And/Shift-Or 1. Flexible Pattern Matching in Strings,
More informationOn line construction of suffix trees 1
(To appear in ALGORITHMICA) On line construction of suffix trees 1 Esko Ukkonen Department of Computer Science, University of Helsinki, P. O. Box 26 (Teollisuuskatu 23), FIN 00014 University of Helsinki,
More informationContributing Efforts of Various String Matching Methodologies in Real World Applications
International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-I E-ISSN: 2347-2693 Contributing Efforts of Various String Matching Methodologies in Real World Applications
More informationA Multiple Sliding Windows Approach to Speed Up String Matching Algorithms
A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms Simone Faro and Thierry Lecroq Università di Catania, Viale A.Doria n.6, 95125 Catania, Italy Université de Rouen, LITIS EA 4108,
More informationCMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma
CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma Please Note: The references at the end are given for extra reading if you are interested in exploring these ideas further. You are
More information8.1 Min Degree Spanning Tree
CS880: Approximations Algorithms Scribe: Siddharth Barman Lecturer: Shuchi Chawla Topic: Min Degree Spanning Tree Date: 02/15/07 In this lecture we give a local search based algorithm for the Min Degree
More informationA Non-Linear Schema Theorem for Genetic Algorithms
A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland
More informationProtecting Websites from Dissociative Identity SQL Injection Attacka Patch for Human Folly
International Journal of Computer Sciences and Engineering Open Access ReviewPaper Volume-4, Special Issue-2, April 2016 E-ISSN: 2347-2693 Protecting Websites from Dissociative Identity SQL Injection Attacka
More informationMinimizing Probing Cost and Achieving Identifiability in Probe Based Network Link Monitoring
Minimizing Probing Cost and Achieving Identifiability in Probe Based Network Link Monitoring Qiang Zheng, Student Member, IEEE, and Guohong Cao, Fellow, IEEE Department of Computer Science and Engineering
More informationRobust Quick String Matching Algorithm for Network Security
18 IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.7B, July 26 Robust Quick String Matching Algorithm for Network Security Jianming Yu, 1,2 and Yibo Xue, 2,3 1 Department
More informationApproximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs
Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs Yong Zhang 1.2, Francis Y.L. Chin 2, and Hing-Fung Ting 2 1 College of Mathematics and Computer Science, Hebei University,
More informationThe Student-Project Allocation Problem
The Student-Project Allocation Problem David J. Abraham, Robert W. Irving, and David F. Manlove Department of Computing Science, University of Glasgow, Glasgow G12 8QQ, UK Email: {dabraham,rwi,davidm}@dcs.gla.ac.uk.
More information1 Approximating Set Cover
CS 05: Algorithms (Grad) Feb 2-24, 2005 Approximating Set Cover. Definition An Instance (X, F ) of the set-covering problem consists of a finite set X and a family F of subset of X, such that every elemennt
More informationFairness in Routing and Load Balancing
Fairness in Routing and Load Balancing Jon Kleinberg Yuval Rabani Éva Tardos Abstract We consider the issue of network routing subject to explicit fairness conditions. The optimization of fairness criteria
More informationA Fast Pattern Matching Algorithm with Two Sliding Windows (TSW)
Journal of Computer Science 4 (5): 393-401, 2008 ISSN 1549-3636 2008 Science Publications A Fast Pattern Matching Algorithm with Two Sliding Windows (TSW) Amjad Hudaib, Rola Al-Khalid, Dima Suleiman, Mariam
More informationApproximate Search Engine Optimization for Directory Service
Approximate Search Engine Optimization for Directory Service Kai-Hsiang Yang and Chi-Chien Pan and Tzao-Lin Lee Department of Computer Science and Information Engineering, National Taiwan University, Taipei,
More informationApproximability of Two-Machine No-Wait Flowshop Scheduling with Availability Constraints
Approximability of Two-Machine No-Wait Flowshop Scheduling with Availability Constraints T.C. Edwin Cheng 1, and Zhaohui Liu 1,2 1 Department of Management, The Hong Kong Polytechnic University Kowloon,
More informationImproved Single and Multiple Approximate String Matching
Improved Single and Multiple Approximate String Matching Kimmo Fredrisson and Gonzalo Navarro 2 Department of Computer Science, University of Joensuu fredri@cs.joensuu.fi 2 Department of Computer Science,
More informationarxiv:1112.0829v1 [math.pr] 5 Dec 2011
How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly
More informationA Sublinear Bipartiteness Tester for Bounded Degree Graphs
A Sublinear Bipartiteness Tester for Bounded Degree Graphs Oded Goldreich Dana Ron February 5, 1998 Abstract We present a sublinear-time algorithm for testing whether a bounded degree graph is bipartite
More informationLecture 4: BK inequality 27th August and 6th September, 2007
CSL866: Percolation and Random Graphs IIT Delhi Amitabha Bagchi Scribe: Arindam Pal Lecture 4: BK inequality 27th August and 6th September, 2007 4. Preliminaries The FKG inequality allows us to lower bound
More informationNP-Completeness and Cook s Theorem
NP-Completeness and Cook s Theorem Lecture notes for COM3412 Logic and Computation 15th January 2002 1 NP decision problems The decision problem D L for a formal language L Σ is the computational task:
More informationDuplicating and its Applications in Batch Scheduling
Duplicating and its Applications in Batch Scheduling Yuzhong Zhang 1 Chunsong Bai 1 Shouyang Wang 2 1 College of Operations Research and Management Sciences Qufu Normal University, Shandong 276826, China
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More informationImproved Single and Multiple Approximate String Matching
Improved Single and Multiple Approximate String Matching Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Gonzalo Navarro Department of Computer Science, University of Chile
More informationInformation Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
More informationCompetitive Analysis of On line Randomized Call Control in Cellular Networks
Competitive Analysis of On line Randomized Call Control in Cellular Networks Ioannis Caragiannis Christos Kaklamanis Evi Papaioannou Abstract In this paper we address an important communication issue arising
More informationCS 598CSC: Combinatorial Optimization Lecture date: 2/4/2010
CS 598CSC: Combinatorial Optimization Lecture date: /4/010 Instructor: Chandra Chekuri Scribe: David Morrison Gomory-Hu Trees (The work in this section closely follows [3]) Let G = (V, E) be an undirected
More informationBicolored Shortest Paths in Graphs with Applications to Network Overlay Design
Bicolored Shortest Paths in Graphs with Applications to Network Overlay Design Hongsik Choi and Hyeong-Ah Choi Department of Electrical Engineering and Computer Science George Washington University Washington,
More informationPolarization codes and the rate of polarization
Polarization codes and the rate of polarization Erdal Arıkan, Emre Telatar Bilkent U., EPFL Sept 10, 2008 Channel Polarization Given a binary input DMC W, i.i.d. uniformly distributed inputs (X 1,...,
More informationDynamic Programming. Lecture 11. 11.1 Overview. 11.2 Introduction
Lecture 11 Dynamic Programming 11.1 Overview Dynamic Programming is a powerful technique that allows one to solve many different types of problems in time O(n 2 ) or O(n 3 ) for which a naive approach
More informationA Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms
A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms Thierry Garcia and David Semé LaRIA Université de Picardie Jules Verne, CURI, 5, rue du Moulin Neuf 80000 Amiens, France, E-mail:
More informationBig Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network
, pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and
More informationAn Introduction to Information Theory
An Introduction to Information Theory Carlton Downey November 12, 2013 INTRODUCTION Today s recitation will be an introduction to Information Theory Information theory studies the quantification of Information
More information6.2 Permutations continued
6.2 Permutations continued Theorem A permutation on a finite set A is either a cycle or can be expressed as a product (composition of disjoint cycles. Proof is by (strong induction on the number, r, of
More informationDefinition 11.1. Given a graph G on n vertices, we define the following quantities:
Lecture 11 The Lovász ϑ Function 11.1 Perfect graphs We begin with some background on perfect graphs. graphs. First, we define some quantities on Definition 11.1. Given a graph G on n vertices, we define
More informationarxiv:0810.2390v2 [cs.ds] 15 Oct 2008
Efficient Pattern Matching on Binary Strings Simone Faro 1 and Thierry Lecroq 2 arxiv:0810.2390v2 [cs.ds] 15 Oct 2008 1 Dipartimento di Matematica e Informatica, Università di Catania, Italy 2 University
More informationThe Load-Distance Balancing Problem
The Load-Distance Balancing Problem Edward Bortnikov Samir Khuller Yishay Mansour Joseph (Seffi) Naor Yahoo! Research, Matam Park, Haifa 31905 (Israel) Department of Computer Science, University of Maryland,
More informationSingle machine parallel batch scheduling with unbounded capacity
Workshop on Combinatorics and Graph Theory 21th, April, 2006 Nankai University Single machine parallel batch scheduling with unbounded capacity Yuan Jinjiang Department of mathematics, Zhengzhou University
More informationPolynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range
THEORY OF COMPUTING, Volume 1 (2005), pp. 37 46 http://theoryofcomputing.org Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range Andris Ambainis
More informationCSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms
More informationOptimizing Pattern Matching for Intrusion Detection
Optimizing Pattern Matching for Intrusion Detection Marc Norton Abstract This paper presents an optimized version of the Aho-Corasick [1] algorithm. This design represents a significant enhancement to
More information1 Introductory Comments. 2 Bayesian Probability
Introductory Comments First, I would like to point out that I got this material from two sources: The first was a page from Paul Graham s website at www.paulgraham.com/ffb.html, and the second was a paper
More informationConcrete Security of the Blum-Blum-Shub Pseudorandom Generator
Appears in Cryptography and Coding: 10th IMA International Conference, Lecture Notes in Computer Science 3796 (2005) 355 375. Springer-Verlag. Concrete Security of the Blum-Blum-Shub Pseudorandom Generator
More informationBreaking Generalized Diffie-Hellman Modulo a Composite is no Easier than Factoring
Breaking Generalized Diffie-Hellman Modulo a Composite is no Easier than Factoring Eli Biham Dan Boneh Omer Reingold Abstract The Diffie-Hellman key-exchange protocol may naturally be extended to k > 2
More informationRegular Expressions with Nested Levels of Back Referencing Form a Hierarchy
Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy Kim S. Larsen Odense University Abstract For many years, regular expressions with back referencing have been used in a variety
More informationClassification - Examples
Lecture 2 Scheduling 1 Classification - Examples 1 r j C max given: n jobs with processing times p 1,...,p n and release dates r 1,...,r n jobs have to be scheduled without preemption on one machine taking
More informationOn the independence number of graphs with maximum degree 3
On the independence number of graphs with maximum degree 3 Iyad A. Kanj Fenghui Zhang Abstract Let G be an undirected graph with maximum degree at most 3 such that G does not contain any of the three graphs
More informationReading 13 : Finite State Automata and Regular Expressions
CS/Math 24: Introduction to Discrete Mathematics Fall 25 Reading 3 : Finite State Automata and Regular Expressions Instructors: Beck Hasti, Gautam Prakriya In this reading we study a mathematical model
More informationSingle-Link Failure Detection in All-Optical Networks Using Monitoring Cycles and Paths
Single-Link Failure Detection in All-Optical Networks Using Monitoring Cycles and Paths Satyajeet S. Ahuja, Srinivasan Ramasubramanian, and Marwan Krunz Department of ECE, University of Arizona, Tucson,
More informationThe Goldberg Rao Algorithm for the Maximum Flow Problem
The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }
More informationTriangle deletion. Ernie Croot. February 3, 2010
Triangle deletion Ernie Croot February 3, 2010 1 Introduction The purpose of this note is to give an intuitive outline of the triangle deletion theorem of Ruzsa and Szemerédi, which says that if G = (V,
More information2ND QUARTER 2006, VOLUME 8, NO. 2
ND QUARTER 6, VOLUME, NO. www.comsoc.org/pubs/surveys PROFILING AND ACCELERATING STRING MATCHING ALGORITHMS IN THREE NETWORK CONTENT SECURITY APPLICATIONS PO-CHING LIN, ZHI-XIANG LI, AND YING-DAR LIN,
More informationSHORT CYCLE COVERS OF GRAPHS WITH MINIMUM DEGREE THREE
SHOT YLE OVES OF PHS WITH MINIMUM DEEE THEE TOMÁŠ KISE, DNIEL KÁL, END LIDIKÝ, PVEL NEJEDLÝ OET ŠÁML, ND bstract. The Shortest ycle over onjecture of lon and Tarsi asserts that the edges of every bridgeless
More informationA Network Flow Approach in Cloud Computing
1 A Network Flow Approach in Cloud Computing Soheil Feizi, Amy Zhang, Muriel Médard RLE at MIT Abstract In this paper, by using network flow principles, we propose algorithms to address various challenges
More informationLoad Balancing and Switch Scheduling
EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load
More informationLongest Common Extensions via Fingerprinting
Longest Common Extensions via Fingerprinting Philip Bille, Inge Li Gørtz, and Jesper Kristensen Technical University of Denmark, DTU Informatics, Copenhagen, Denmark Abstract. The longest common extension
More informationNotes on Complexity Theory Last updated: August, 2011. Lecture 1
Notes on Complexity Theory Last updated: August, 2011 Jonathan Katz Lecture 1 1 Turing Machines I assume that most students have encountered Turing machines before. (Students who have not may want to look
More informationPrivate Approximation of Clustering and Vertex Cover
Private Approximation of Clustering and Vertex Cover Amos Beimel, Renen Hallak, and Kobbi Nissim Department of Computer Science, Ben-Gurion University of the Negev Abstract. Private approximation of search
More informationEvery tree contains a large induced subgraph with all degrees odd
Every tree contains a large induced subgraph with all degrees odd A.J. Radcliffe Carnegie Mellon University, Pittsburgh, PA A.D. Scott Department of Pure Mathematics and Mathematical Statistics University
More informationCacti with minimum, second-minimum, and third-minimum Kirchhoff indices
MATHEMATICAL COMMUNICATIONS 47 Math. Commun., Vol. 15, No. 2, pp. 47-58 (2010) Cacti with minimum, second-minimum, and third-minimum Kirchhoff indices Hongzhuan Wang 1, Hongbo Hua 1, and Dongdong Wang
More informationSerial and Parallel Bayesian Spam Filtering using Aho- Corasick and PFAC
Serial and Parallel Bayesian Spam Filtering using Aho- Corasick and PFAC Saima Haseeb TIEIT Bhopal Mahak Motwani TIEIT Bhopal Amit Saxena TIEIT Bhopal ABSTRACT With the rapid growth of Internet, E-mail,
More informationPart 2: Community Detection
Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -
More informationP versus NP, and More
1 P versus NP, and More Great Ideas in Theoretical Computer Science Saarland University, Summer 2014 If you have tried to solve a crossword puzzle, you know that it is much harder to solve it than to verify
More informationOutline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits
Outline NP-completeness Examples of Easy vs. Hard problems Euler circuit vs. Hamiltonian circuit Shortest Path vs. Longest Path 2-pairs sum vs. general Subset Sum Reducing one problem to another Clique
More informationTopic: Greedy Approximations: Set Cover and Min Makespan Date: 1/30/06
CS880: Approximations Algorithms Scribe: Matt Elder Lecturer: Shuchi Chawla Topic: Greedy Approximations: Set Cover and Min Makespan Date: 1/30/06 3.1 Set Cover The Set Cover problem is: Given a set of
More informationApproximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
More informationHandout #Ch7 San Skulrattanakulchai Gustavus Adolphus College Dec 6, 2010. Chapter 7: Digraphs
MCS-236: Graph Theory Handout #Ch7 San Skulrattanakulchai Gustavus Adolphus College Dec 6, 2010 Chapter 7: Digraphs Strong Digraphs Definitions. A digraph is an ordered pair (V, E), where V is the set
More informationFault Analysis in Software with the Data Interaction of Classes
, pp.189-196 http://dx.doi.org/10.14257/ijsia.2015.9.9.17 Fault Analysis in Software with the Data Interaction of Classes Yan Xiaobo 1 and Wang Yichen 2 1 Science & Technology on Reliability & Environmental
More informationDistributed Computing over Communication Networks: Maximal Independent Set
Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.
More informationOHJ-2306 Introduction to Theoretical Computer Science, Fall 2012 8.11.2012
276 The P vs. NP problem is a major unsolved problem in computer science It is one of the seven Millennium Prize Problems selected by the Clay Mathematics Institute to carry a $ 1,000,000 prize for the
More informationUNIVERSIDADE DE SÃO PAULO
UNIVERSIDADE DE SÃO PAULO Instituto de Ciências Matemáticas e de Computação ISSN 0103-2569 Comments on On minimizing the lengths of checking sequences Adenilso da Silva Simão N ō 307 RELATÓRIOS TÉCNICOS
More informationCompletion Time Scheduling and the WSRPT Algorithm
Completion Time Scheduling and the WSRPT Algorithm Bo Xiong, Christine Chung Department of Computer Science, Connecticut College, New London, CT {bxiong,cchung}@conncoll.edu Abstract. We consider the online
More information1 Introduction. Dr. T. Srinivas Department of Mathematics Kakatiya University Warangal 506009, AP, INDIA tsrinivasku@gmail.com
A New Allgoriitthm for Miiniimum Costt Liinkiing M. Sreenivas Alluri Institute of Management Sciences Hanamkonda 506001, AP, INDIA allurimaster@gmail.com Dr. T. Srinivas Department of Mathematics Kakatiya
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationThe Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
More informationSemi-Automatic Index Tuning: Keeping DBAs in the Loop
Semi-Automatic Index Tuning: Keeping DBAs in the Loop Karl Schnaitter Teradata Aster karl.schnaitter@teradata.com Neoklis Polyzotis UC Santa Cruz alkis@ucsc.edu ABSTRACT To obtain good system performance,
More information14.1 Rent-or-buy problem
CS787: Advanced Algorithms Lecture 14: Online algorithms We now shift focus to a different kind of algorithmic problem where we need to perform some optimization without knowing the input in advance. Algorithms
More information2.1 Complexity Classes
15-859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look
More informationFinding and counting given length cycles
Finding and counting given length cycles Noga Alon Raphael Yuster Uri Zwick Abstract We present an assortment of methods for finding and counting simple cycles of a given length in directed and undirected
More informationEfficient Multi-Feature Index Structures for Music Data Retrieval
header for SPIE use Efficient Multi-Feature Index Structures for Music Data Retrieval Wegin Lee and Arbee L.P. Chen 1 Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 300,
More informationCost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
More informationAnalysis of Algorithms, I
Analysis of Algorithms, I CSOR W4231.002 Eleni Drinea Computer Science Department Columbia University Thursday, February 26, 2015 Outline 1 Recap 2 Representing graphs 3 Breadth-first search (BFS) 4 Applications
More informationDistributed Computing over Communication Networks: Topology. (with an excursion to P2P)
Distributed Computing over Communication Networks: Topology (with an excursion to P2P) Some administrative comments... There will be a Skript for this part of the lecture. (Same as slides, except for today...
More informationReminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new
Reminder: Complexity (1) Parallel Complexity Theory Lecture 6 Number of steps or memory units required to compute some result In terms of input size Using a single processor O(1) says that regardless of
More informationReminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new GAP (2) Graph Accessibility Problem (GAP) (1)
Reminder: Complexity (1) Parallel Complexity Theory Lecture 6 Number of steps or memory units required to compute some result In terms of input size Using a single processor O(1) says that regardless of
More informationShift-based Pattern Matching for Compressed Web Traffic
Shift-based Pattern Matching for Compressed Web Traffic Anat Bremler-Barr Computer Science Dept. Interdisciplinary Center, Herzliya, Israel Email: bremler@idc.ac.il Yaron Koral Blavatnik School of Computer
More information17.6.1 Introduction to Auction Design
CS787: Advanced Algorithms Topic: Sponsored Search Auction Design Presenter(s): Nilay, Srikrishna, Taedong 17.6.1 Introduction to Auction Design The Internet, which started of as a research project in
More informationStationary random graphs on Z with prescribed iid degrees and finite mean connections
Stationary random graphs on Z with prescribed iid degrees and finite mean connections Maria Deijfen Johan Jonasson February 2006 Abstract Let F be a probability distribution with support on the non-negative
More informationrespectively. Section IX concludes the paper.
1 Fast and Scalable Pattern Matching for Network Intrusion Detection Systems Sarang Dharmapurikar, and John Lockwood, Member IEEE Abstract High-speed packet content inspection and filtering devices rely
More information6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation
6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation Daron Acemoglu and Asu Ozdaglar MIT November 2, 2009 1 Introduction Outline The problem of cooperation Finitely-repeated prisoner s dilemma
More information2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
More informationCycles and clique-minors in expanders
Cycles and clique-minors in expanders Benny Sudakov UCLA and Princeton University Expanders Definition: The vertex boundary of a subset X of a graph G: X = { all vertices in G\X with at least one neighbor
More informationIntegrating Benders decomposition within Constraint Programming
Integrating Benders decomposition within Constraint Programming Hadrien Cambazard, Narendra Jussien email: {hcambaza,jussien}@emn.fr École des Mines de Nantes, LINA CNRS FRE 2729 4 rue Alfred Kastler BP
More informationSolutions to Homework 6
Solutions to Homework 6 Debasish Das EECS Department, Northwestern University ddas@northwestern.edu 1 Problem 5.24 We want to find light spanning trees with certain special properties. Given is one example
More informationLarge-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationTotal colorings of planar graphs with small maximum degree
Total colorings of planar graphs with small maximum degree Bing Wang 1,, Jian-Liang Wu, Si-Feng Tian 1 Department of Mathematics, Zaozhuang University, Shandong, 77160, China School of Mathematics, Shandong
More informationx a x 2 (1 + x 2 ) n.
Limits and continuity Suppose that we have a function f : R R. Let a R. We say that f(x) tends to the limit l as x tends to a; lim f(x) = l ; x a if, given any real number ɛ > 0, there exists a real number
More informationDepartment of Computer Science. The University of Arizona. Tucson Arizona
A TEXT COMPRESSION SCHEME THAT ALLOWS FAST SEARCHING DIRECTLY IN THE COMPRESSED FILE Udi Manber Technical report #93-07 (March 1993) Department of Computer Science The University of Arizona Tucson Arizona
More informationOutline Introduction Circuits PRGs Uniform Derandomization Refs. Derandomization. A Basic Introduction. Antonis Antonopoulos.
Derandomization A Basic Introduction Antonis Antonopoulos CoReLab Seminar National Technical University of Athens 21/3/2011 1 Introduction History & Frame Basic Results 2 Circuits Definitions Basic Properties
More information