A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching
|
|
|
- Aleesha Gibbs
- 9 years ago
- Views:
Transcription
1 A Partition-Based Efficient Algorithm for Large Scale Multiple-Strings Matching Ping Liu Jianlong Tan, Yanbing Liu Software Division, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, Abstract Filtering procedure plays an important role in the Internet security and information retrieval fields, and usually employs multiple-strings matching algorithm as its key part. All the classical matching algorithms, however, perform poorly when the number of the keywords exceeds 5000, which made large scale multiple-strings matching problem a great challenge. Based on the observation that the speed of the classical algorithms depends on the minimal length of the keywords, a partition strategy was proposed to decompose the keywords set into a series of subsets on which the classical algorithms was performed. In the optimal partition, it was proved that the keywords with same length would be separated into one subset, and length of keywords in different subsets would not interlace each other. In this paper, we proposed a shortest-path model for the optimal partition problem. Experiments on both random dataset and ClamAV dataset demonstrated our algorithms works much better than the classical ones. Key Words large scale multiple-strings matching, partition, shortest path 1 Introduction With the development of Internet, more and more information, including bad along with good, appeared and congested the network. To secure Internet and retrieve useful information, filtering systems were designed and deployed on the gateways to filter out bad things. A filtering system usually employs a string matching procedure as its key part, and always contains a large scale keywords set to suit to various focuses. Hence, it is really a great challenge to design an efficient multi-strings matching algorithm for a large scale keywords set. This paper is supported by the NSFC grant No..
2 1.1 Related Work String matching problem has been received extensive research, most of which follow a common procedure, i.e., compare keywords with substring of text within a fixed length window, and then shift the window from left to right as far as possible. According to the way that patterns are compared with the text in the window, string matching algorithms can be categorized into three classes: prefix searching, suffix searching and factor searching. In prefix searching methods, window-shifting is accomplished through computing the longest common prefix between the text and the patterns. There are two ways to compute the length of the longest prefix. The first way is to compute the longest suffix of the text read that is also a prefix of the string, typified by the famous KMP Knuth-Morris-Pratt algorithm [2] and the Aho-Corasick algorithm[3]. The second maintain a set of prefix of keywords that are also suffix of text, and update the set at each character read. This is what the Shift-And and the Shift-Or algorithm do[4]. The suffix searching approach consists in searching backward for a suffix of the window that is also a suffix of the keyword, which features the backward searching from right to left within the window. The Boyer-Moore algorithm[5] is of this kind, and so are the Commentz-Walter algorithm [6]and the Wu-Manber algorithm[7]. Factor searching, the most efficient algorithms in practice for long keywords, can be treated as the integration of the prefix searching and the suffix searching, which search backward in a window like suffix searching, but search for the longest suffix of the window that is also a factor of the keyword. The most famous algorithms of this kind are BDM (Backward DAWG Matching) algorithm[8], BOM (Backward Oracle Matching) algorithm[9] SBDM (Set Backward DAWG Matching) algorithm[10], SBOM (Set Backward Oracle Matching) algorithm[11]. The performance of the classical multi-strings matching algorithms are determined mainly by the following three factors: the number of the keywords, the minimal length of the keywords and the size of the alphabet. In addition, the distribution of keywords in the text would also affect the performance. All of the classical algorithms are inapplicable for the case where there are a large scale of keywords set. Experiment on real data demonstrated that these algorithms perform bad when the number of patterns exceeds 5,000 or the minimal length of the patterns is 4 bytes. In this paper, we proposed a partition-based algorithm suitable for filtering with the large-scale keywords set. 1.2 Our Contribution Our contributions within this paper is as follows:
3 1). We analyzed the average time-complexity of three representative algorithms, i.e., SBOM, WuManber and advanced Aho Corasick, and surveyed the relationship between their speed and the the minimal length of the patterns. 2). We proposed a partition-based strategy to bound the influence of the shortest keywords on the performance. For the speed of the classical algorithms are tightly relative to the minimal length of the keywords, partitioning the keywords set into smaller subset would bound the influence of the shortest keywords, and thus improve the performance. 3). We proposed a shortest-path model of the optimal partition. Two properties of the optimal partition were proved, i.e., the keywords with same length would locate in one subset of the optimal partition, and the length of the keywords of different subset would not interlace each other. Based on the above theorems, a shortest-path model was proposed for the optimal partition. We implemented our partition-based algorithm into a C program and did experiment on real data, which demonstrated its efficiency on large-scale keywords set compared with the classical ones. The rest of this paper are organized as follow: Section 2 describes the average time-complexity and property of three classical algorithms; Section 3 proposes the partition-based strategy and prove two theorems for the optimal partitions; Section 4 reports the implementation and the experimental results on real data; Section 5 mentions the further work. 2 Properties about Performance of Classical Algorithms In this section, we analyze the average-case time complexity of three representative multistrings matching algorithms: SBOM, WuManber, advanced Aho Corasick. Let use Σ to denote the alphabet, n to denote the length of the text, r the number of the pattern, and b to denote the block size in WuManber algorithm. Let us denote m(s) or m the minimal length of a keywords set S, and M(S) or M the maximal one. Assuming an uniform distribution of text and keywords over the alphabet Σ, and that n is large enough, it is easy to estimate the average-case time complexity of Advanced Aho Corasick algorithm is O(n), WuManber algorithm is O ( ) n, and SBOM (m b+1) (1 (m b+1) r 2 Σ b ) algorithm is O ( n log Σ mr ). m log Σ mr The above analysis implies the following properties about the speed of the classical algorithms, which are confirmed by experimental result on real data: Property 1 The main factors affecting the speed of multi-strings matching algorithms are the size of alphabet, the number of the patterns and the minimal length of the patterns. Hence,
4 the matching time can be denoted as T (r, m) if the size of alphabet is fixed. Property 2 The matching time of a multi-strings matching algorithm increases monotonously when the number of the patterns increase, i.e., T r(r, m) > 0. (See Fig. 1) Property 3 The matching time of a multi-strings matching algorithms decreases monotonously when minimal length of the patterns increase, i.e., T m(r, m) < 0. (See Fig. 2) Property 4 The increase rate of the matching time with the number of patterns is independent of the number of the patterns, and increase when the minimal length decrease, i.e., F x(x, y) = H(y) > 0andH (y) < 0. 3 A Partition-Based Matching Algorithm The shortest keywords, though very small in quantity, have a great affect on the matching time. To bound their influence, an intuitional idea is to decompose the keywords set into a series of smaller subsets, and then choose an appropriate classical matching algorithms to run on each subset. Since the influence of the shortest keywords is bounded in the smaller subset rather than the entire set, the sum of matching time on individual subsets is even smaller than the time costed to run on an entire set directly. For a given keywords set P = {p 1, p 2, p 3,, p n }, there are many kinds of feasible partitions, among which the optimal one will bring the minimal matching time. Here, we assumed that the keywords was already sorted according to its length, i.e., p 1 p 2 p n. Then the optimal partition finding problem can be defined as follows: Optimal Partition Finding Problem Given a sorted keywords set P = {p 1, p 2,, p n }, k to construct a partition S 1, S 2,, S k, so that S i = P, 1 i, j k, S i Sj =, and k T (m(s i ), S i ) is minimized. i=1 3.1 Properties about Optimal Partition In the following, two properties about the optimal partition are proved to provide solid foundation for finding it. Theorem 1 There exist an optimal partition, S 1, S 2,, S k, of the sorted keywords set P = {p 1, p 2,, p n }, and for i j either a S i, b S j, a b ; or a S i, b S j, a b ; This theorem demonstrates the continuum of the optimal partition, that is, the intervals formed by the subset, [m(s i ), M(S i )], would not intercross each other. For the sake of simplicity, we use m i and m j to denote the minimal length of S i and S j, n i and n j the number of keywords in them, respectively. i=1
5 Proof: Suppose in an optimal partition S 1, S 2,, S k, there exist two subsets S i and S j with intercrossing interval, i.e.,[m(s i ), M(S i )] intercross with [m(s j ), M(S j )]. We give proof for the case m i m j, and the case m i m j is similar thus omitted. Two new subset S i and S j were constructed by exchanging the longest keyword of S i with the shortest one in S j. Thus, we have: m i = m i, m j m j ; n i = n i, n j = n j ; Hence, (T (m i, n i ) + T (m j, n j )) (T (m i, n i ) + T (m j, n j )) = (T (m i, n i ) T (m i, n i )) + (T (m j, n j ) T (m j, n j )) 0 (Property 3 in section 2, and m(s j ) > m(s j )) Thus, the new partition is better than the original one, which contradicts with the assumption. Hence, the theorem holds. Theorem 2 In the optimal partition of the sorted keywords set P = {p 1, p 2, p n }, keywords with the same length would not disperse into different subsets. Theorem 1 assures that the length of the keywords in different subsets would not interlace, hence the keywords with same length would not disperse into more than two subsets. So, it suffices to complete the proof for the case of two subsets. Proof. In an optimal partition S 1, S 2,, S k, suppose the C keywords with same length l were split into two subsets, i.e., C i keywords in S i and C j in S j (C i > 0, C j > 0, C i + C j = C). For S i and S j, two new subsets, S i and S j, could be constructed through removing the C i keywords from S i to S j. For the sake of simplicity, we use m i and m j to denote the minimal length of S i and S j, n i and n j the number of keywords in them, respectively. We give proof for the case m(s i ) m(s j ), and the case m(s i ) m(s j ) is similar thus omitted. In this case, we have m i = m i, m j = m j ; n i = n i C i, n j = n j + C i ; Hence, (T (m i, n i ) + T (m j, n j )) (T (m i, n i ) + T (m j, n j )) = (T (m i, n i ) T (m i, n i C i )) + (T (m j, n j ) T (m j, n j + C i )) = C i n T (m i, n i δ 1 C i ) C i n T (m j, n j + δ 2 C i ) (0 δ 1 1, 0 δ 2 1) 0 (Propety 4 in section 2, and m1 < m2) Thus, the new partition is better than the original one, which conflict with the assumption. Therefore, in an optimal partition, the keywords with same length would be in one subset. The above two properties imply that the keywords with same length work as a block, that is, they would not separate in an optimal partition. Moreover, a subset S i of an optimal partition contains all the blocks with length in the interval [m(s i ), M(S i )].
6 3.2 Algorithm to Find the Optimal Partition In this section, we model the optimal partition problem into finding the shortest-path problem in a weighted graph. Given a sorted keywords set P = p 1, p 2,, p n, we create a partition graph G as follows. For each a block with length i in P, a node N i is created to represent it, and an auxiliary node N M(P )+1 is created to represent the end of P. Let V = {N m(p ), N m(p )+1,, N M(P ), N M(P )+1 }. The edges of G is specified as follows. For N i and N j V, there is an edge from N i to N j, denoted as (N i, N j ) if i < j. In deed, an edge (N i, N j ) is used to represent a subset containing blocks with length greater than or equal with N i, but less than N j. For each edge (N i, N j ), a weight W (N i, N j ) was assigned to measure the benefit of setting the corresponding blocks as a subset. The matching time of a representative text on the subset was used as an estimation of W (N i, N j ). Therefore, the optimal partition correspond to the shortest path in the partition graph G. An example is shown in Fig The short path in the graph is 2 > 6 > 8, hence the optimal partition has two subset, one containing keywords with length 2,3,4,5, and the other having keywords with length 6,7. The algorithm to find the optimal partition is given as follows: Algorithm to Find the Optimal Partition Input: A sorted keywords P, a representative text T ; Output: The optimal partition of P ; 1. Construct the partition graph G =< V, E >, here, V = {N m(p ), N m(p )+1,, N M(P ), N M(P )+1 }; E = {(N i, N j ) N i, N j V, i < j}.. W (N i, N j ) is set as above; 2. Finding the shortest path (e i1, e i2,, e ik ) from N m(p ) to N M(P )+1. Here, e ij is an edge in G; 3. For each e ij, output a subset containing the corresponding blocks; 4 Experimental Result 4.1 Results on Random Data Set In the test on the random data, the size of the alphabet is 32. A program is made to build the pattern randomly. We build two group patterns which number are 5000, and length are from 4 bytes to 40 bytes. The random patterns are symmetrical in the number of the length. The search text is also build randomly. It s size about 200M.
7 It can be seen from above that the speed of the WuManber algorithm is much more quick than the other two algorithms among the three classical algorithms. Yet when the number of the pattern number is increase rapidly, WuManber s speed decrease more quickly than the other two algorithms. In both of the two group experiments, the speed of the COM algorithm is the most best. In the COM algorithm the positions of the partition and the basic algorithms are different in the test patterns. By the all, the speed of the COM algorithm could increase averagely 1-3 times than the classical algorithms. The larger the number of the patterns is, the larger the increase is. 4.2 Results on Real Data Set The best test is to use the real patterns from real systems. In this test we use two groups data: one group is extracted from Snort, the other is extracted from the signatures of ClamAV. Snort is a open source IDS. It s last version can download from We use the version and extracted 2086 patterns which lengths are larger than 1 byte. ClamAntiVirus is a open source AntiVirus system. It s virus database is updated everyday and the last version can download from We use the version 0.83 and extracted patterns which no wildcards in them. The training text is from MIT, a group data of the real network, which are used to evaluate the capability of IDS. The data set can download from We use the file mit 1999 training week1 Friday inside.dat. We cut off the file from 64M to 16M for quickly training. The matched text is mit 1999 training week1 Friday inside.dat, about 64M. In the follow table the left part are the distributing of pattern lengths and the right parts
8 are the test results. It can be seen from above that in COM the optical partition on Snort patterns is only one group, use Wumanber algorithm. This mean that COM is not must better than the classical algorithms on some special pattern sets. On the other side this also mean that the speed of COM must not slower than the speed of the classical algorithms, and the speed at least equal to the most quick one of the classical algorithms, generally more quickly. The superiority of COM is distinct on the ClamAV patterns. When the length range of patterns is large and the length distributing is very asymmetrical, use partition strategy can increase the speed of matching. The increase of the matching speed is more obviously with the increase of the pattern number.
9 5 Conclusions Conclusion is here. Acknowledgment: The author expresses him deep appreciation to his advisors for help on the subject of this paper. References [1] Gonzalo Navarro and Mathieu Raffinot, Flexible Pattern Matching in Strings Practical on-line search algorithms for texts and biological sequences, Camedge University Press,2002,ISBN pp74 76 [2] D.E.Knuth,J.H.Morris,V.R.Pratt, Fast Pattern Matching in Strings,SIAM Journal on Computing,Page ,1977 [3] A.V. Aho and M.J.Corasick, Efficient string matching:an aid to bibliographic search, Communication of the ACM,18(6): ,1975 [4] S. Wu, U. Manber, Fast text searching allowing errors,communications of the ACM, 35(10): 83 91,1992 [5] R.S.Boyer, J.S.Moore, A fast string searching algorithm,communications of the ACM,20(10): ,1977 [6] B.Commentz-Walter, A string matching algorithm fast on the average, In Proceeding s of the 6th International Colloquium on Automata, Language and Programming, number 71 in Lecture Notes in Computer Science,pages ,1979 [7] S.Wu, U.Manber, A fast algorithm for multi-pattern searching, Report TR 94 17, Department of Computer Science, University of Arizona,Tucson, AZ,1994 [8] M.Crochemore, A.Czumaj, L.Gasienniec, S.Jarominek, T.Lecroq, W.Plandowski, W.Rytter, Speeding up two string matching algorithms, Algorithmica,12(4/5): ,1994 [9] C.Allauzen, M.Crochemore, M.Raffinot, Efficient experimental string matching by weak factor recognition, In proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, number 2089 in Lecture Notes in Computer Science,pages Springer-Verlag,2001 [10] A.Blumer, J.Blumer, A.Ehrenfeucht, D.Haussler, R.McConnel. Complete inverted files for efficient text retrieval and analysis, Jonual of the ACM,34(3): ,1987 [11] C.Allauzen, M.Raffinot Factor oracle of a set of words,technical report 99 11, Institute Gaspard- Monge, University de Marne-la-vallee, 1999 [12] Xiaodong Wang The Design and Analysis of Computer Algorithms Publishing House of Electronic Industry, Beijing ISBN P
Fast string matching
Fast string matching This exposition is based on earlier versions of this lecture and the following sources, which are all recommended reading: Shift-And/Shift-Or 1. Flexible Pattern Matching in Strings,
On line construction of suffix trees 1
(To appear in ALGORITHMICA) On line construction of suffix trees 1 Esko Ukkonen Department of Computer Science, University of Helsinki, P. O. Box 26 (Teollisuuskatu 23), FIN 00014 University of Helsinki,
A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms
A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms Simone Faro and Thierry Lecroq Università di Catania, Viale A.Doria n.6, 95125 Catania, Italy Université de Rouen, LITIS EA 4108,
CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma
CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma Please Note: The references at the end are given for extra reading if you are interested in exploring these ideas further. You are
8.1 Min Degree Spanning Tree
CS880: Approximations Algorithms Scribe: Siddharth Barman Lecturer: Shuchi Chawla Topic: Min Degree Spanning Tree Date: 02/15/07 In this lecture we give a local search based algorithm for the Min Degree
A Non-Linear Schema Theorem for Genetic Algorithms
A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland
Minimizing Probing Cost and Achieving Identifiability in Probe Based Network Link Monitoring
Minimizing Probing Cost and Achieving Identifiability in Probe Based Network Link Monitoring Qiang Zheng, Student Member, IEEE, and Guohong Cao, Fellow, IEEE Department of Computer Science and Engineering
Robust Quick String Matching Algorithm for Network Security
18 IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.7B, July 26 Robust Quick String Matching Algorithm for Network Security Jianming Yu, 1,2 and Yibo Xue, 2,3 1 Department
Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs
Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs Yong Zhang 1.2, Francis Y.L. Chin 2, and Hing-Fung Ting 2 1 College of Mathematics and Computer Science, Hebei University,
The Student-Project Allocation Problem
The Student-Project Allocation Problem David J. Abraham, Robert W. Irving, and David F. Manlove Department of Computing Science, University of Glasgow, Glasgow G12 8QQ, UK Email: {dabraham,rwi,davidm}@dcs.gla.ac.uk.
1 Approximating Set Cover
CS 05: Algorithms (Grad) Feb 2-24, 2005 Approximating Set Cover. Definition An Instance (X, F ) of the set-covering problem consists of a finite set X and a family F of subset of X, such that every elemennt
Fairness in Routing and Load Balancing
Fairness in Routing and Load Balancing Jon Kleinberg Yuval Rabani Éva Tardos Abstract We consider the issue of network routing subject to explicit fairness conditions. The optimization of fairness criteria
A Fast Pattern Matching Algorithm with Two Sliding Windows (TSW)
Journal of Computer Science 4 (5): 393-401, 2008 ISSN 1549-3636 2008 Science Publications A Fast Pattern Matching Algorithm with Two Sliding Windows (TSW) Amjad Hudaib, Rola Al-Khalid, Dima Suleiman, Mariam
Approximability of Two-Machine No-Wait Flowshop Scheduling with Availability Constraints
Approximability of Two-Machine No-Wait Flowshop Scheduling with Availability Constraints T.C. Edwin Cheng 1, and Zhaohui Liu 1,2 1 Department of Management, The Hong Kong Polytechnic University Kowloon,
Improved Single and Multiple Approximate String Matching
Improved Single and Multiple Approximate String Matching Kimmo Fredrisson and Gonzalo Navarro 2 Department of Computer Science, University of Joensuu [email protected] 2 Department of Computer Science,
arxiv:1112.0829v1 [math.pr] 5 Dec 2011
How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly
Lecture 4: BK inequality 27th August and 6th September, 2007
CSL866: Percolation and Random Graphs IIT Delhi Amitabha Bagchi Scribe: Arindam Pal Lecture 4: BK inequality 27th August and 6th September, 2007 4. Preliminaries The FKG inequality allows us to lower bound
NP-Completeness and Cook s Theorem
NP-Completeness and Cook s Theorem Lecture notes for COM3412 Logic and Computation 15th January 2002 1 NP decision problems The decision problem D L for a formal language L Σ is the computational task:
Duplicating and its Applications in Batch Scheduling
Duplicating and its Applications in Batch Scheduling Yuzhong Zhang 1 Chunsong Bai 1 Shouyang Wang 2 1 College of Operations Research and Management Sciences Qufu Normal University, Shandong 276826, China
Offline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: [email protected] 2 IBM India Research Lab, New Delhi. email: [email protected]
Improved Single and Multiple Approximate String Matching
Improved Single and Multiple Approximate String Matching Kimmo Fredriksson Department of Computer Science, University of Joensuu, Finland Gonzalo Navarro Department of Computer Science, University of Chile
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
Competitive Analysis of On line Randomized Call Control in Cellular Networks
Competitive Analysis of On line Randomized Call Control in Cellular Networks Ioannis Caragiannis Christos Kaklamanis Evi Papaioannou Abstract In this paper we address an important communication issue arising
CS 598CSC: Combinatorial Optimization Lecture date: 2/4/2010
CS 598CSC: Combinatorial Optimization Lecture date: /4/010 Instructor: Chandra Chekuri Scribe: David Morrison Gomory-Hu Trees (The work in this section closely follows [3]) Let G = (V, E) be an undirected
Bicolored Shortest Paths in Graphs with Applications to Network Overlay Design
Bicolored Shortest Paths in Graphs with Applications to Network Overlay Design Hongsik Choi and Hyeong-Ah Choi Department of Electrical Engineering and Computer Science George Washington University Washington,
Polarization codes and the rate of polarization
Polarization codes and the rate of polarization Erdal Arıkan, Emre Telatar Bilkent U., EPFL Sept 10, 2008 Channel Polarization Given a binary input DMC W, i.i.d. uniformly distributed inputs (X 1,...,
Dynamic Programming. Lecture 11. 11.1 Overview. 11.2 Introduction
Lecture 11 Dynamic Programming 11.1 Overview Dynamic Programming is a powerful technique that allows one to solve many different types of problems in time O(n 2 ) or O(n 3 ) for which a naive approach
A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms
A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms Thierry Garcia and David Semé LaRIA Université de Picardie Jules Verne, CURI, 5, rue du Moulin Neuf 80000 Amiens, France, E-mail:
Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network
, pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and
An Introduction to Information Theory
An Introduction to Information Theory Carlton Downey November 12, 2013 INTRODUCTION Today s recitation will be an introduction to Information Theory Information theory studies the quantification of Information
6.2 Permutations continued
6.2 Permutations continued Theorem A permutation on a finite set A is either a cycle or can be expressed as a product (composition of disjoint cycles. Proof is by (strong induction on the number, r, of
Definition 11.1. Given a graph G on n vertices, we define the following quantities:
Lecture 11 The Lovász ϑ Function 11.1 Perfect graphs We begin with some background on perfect graphs. graphs. First, we define some quantities on Definition 11.1. Given a graph G on n vertices, we define
The Load-Distance Balancing Problem
The Load-Distance Balancing Problem Edward Bortnikov Samir Khuller Yishay Mansour Joseph (Seffi) Naor Yahoo! Research, Matam Park, Haifa 31905 (Israel) Department of Computer Science, University of Maryland,
Single machine parallel batch scheduling with unbounded capacity
Workshop on Combinatorics and Graph Theory 21th, April, 2006 Nankai University Single machine parallel batch scheduling with unbounded capacity Yuan Jinjiang Department of mathematics, Zhengzhou University
Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range
THEORY OF COMPUTING, Volume 1 (2005), pp. 37 46 http://theoryofcomputing.org Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range Andris Ambainis
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms
Optimizing Pattern Matching for Intrusion Detection
Optimizing Pattern Matching for Intrusion Detection Marc Norton Abstract This paper presents an optimized version of the Aho-Corasick [1] algorithm. This design represents a significant enhancement to
1 Introductory Comments. 2 Bayesian Probability
Introductory Comments First, I would like to point out that I got this material from two sources: The first was a page from Paul Graham s website at www.paulgraham.com/ffb.html, and the second was a paper
Concrete Security of the Blum-Blum-Shub Pseudorandom Generator
Appears in Cryptography and Coding: 10th IMA International Conference, Lecture Notes in Computer Science 3796 (2005) 355 375. Springer-Verlag. Concrete Security of the Blum-Blum-Shub Pseudorandom Generator
Breaking Generalized Diffie-Hellman Modulo a Composite is no Easier than Factoring
Breaking Generalized Diffie-Hellman Modulo a Composite is no Easier than Factoring Eli Biham Dan Boneh Omer Reingold Abstract The Diffie-Hellman key-exchange protocol may naturally be extended to k > 2
Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy
Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy Kim S. Larsen Odense University Abstract For many years, regular expressions with back referencing have been used in a variety
Classification - Examples
Lecture 2 Scheduling 1 Classification - Examples 1 r j C max given: n jobs with processing times p 1,...,p n and release dates r 1,...,r n jobs have to be scheduled without preemption on one machine taking
On the independence number of graphs with maximum degree 3
On the independence number of graphs with maximum degree 3 Iyad A. Kanj Fenghui Zhang Abstract Let G be an undirected graph with maximum degree at most 3 such that G does not contain any of the three graphs
Reading 13 : Finite State Automata and Regular Expressions
CS/Math 24: Introduction to Discrete Mathematics Fall 25 Reading 3 : Finite State Automata and Regular Expressions Instructors: Beck Hasti, Gautam Prakriya In this reading we study a mathematical model
Single-Link Failure Detection in All-Optical Networks Using Monitoring Cycles and Paths
Single-Link Failure Detection in All-Optical Networks Using Monitoring Cycles and Paths Satyajeet S. Ahuja, Srinivasan Ramasubramanian, and Marwan Krunz Department of ECE, University of Arizona, Tucson,
The Goldberg Rao Algorithm for the Maximum Flow Problem
The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }
Triangle deletion. Ernie Croot. February 3, 2010
Triangle deletion Ernie Croot February 3, 2010 1 Introduction The purpose of this note is to give an intuitive outline of the triangle deletion theorem of Ruzsa and Szemerédi, which says that if G = (V,
A Network Flow Approach in Cloud Computing
1 A Network Flow Approach in Cloud Computing Soheil Feizi, Amy Zhang, Muriel Médard RLE at MIT Abstract In this paper, by using network flow principles, we propose algorithms to address various challenges
Load Balancing and Switch Scheduling
EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: [email protected] Abstract Load
Notes on Complexity Theory Last updated: August, 2011. Lecture 1
Notes on Complexity Theory Last updated: August, 2011 Jonathan Katz Lecture 1 1 Turing Machines I assume that most students have encountered Turing machines before. (Students who have not may want to look
Private Approximation of Clustering and Vertex Cover
Private Approximation of Clustering and Vertex Cover Amos Beimel, Renen Hallak, and Kobbi Nissim Department of Computer Science, Ben-Gurion University of the Negev Abstract. Private approximation of search
Every tree contains a large induced subgraph with all degrees odd
Every tree contains a large induced subgraph with all degrees odd A.J. Radcliffe Carnegie Mellon University, Pittsburgh, PA A.D. Scott Department of Pure Mathematics and Mathematical Statistics University
Cacti with minimum, second-minimum, and third-minimum Kirchhoff indices
MATHEMATICAL COMMUNICATIONS 47 Math. Commun., Vol. 15, No. 2, pp. 47-58 (2010) Cacti with minimum, second-minimum, and third-minimum Kirchhoff indices Hongzhuan Wang 1, Hongbo Hua 1, and Dongdong Wang
Serial and Parallel Bayesian Spam Filtering using Aho- Corasick and PFAC
Serial and Parallel Bayesian Spam Filtering using Aho- Corasick and PFAC Saima Haseeb TIEIT Bhopal Mahak Motwani TIEIT Bhopal Amit Saxena TIEIT Bhopal ABSTRACT With the rapid growth of Internet, E-mail,
Part 2: Community Detection
Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -
P versus NP, and More
1 P versus NP, and More Great Ideas in Theoretical Computer Science Saarland University, Summer 2014 If you have tried to solve a crossword puzzle, you know that it is much harder to solve it than to verify
Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits
Outline NP-completeness Examples of Easy vs. Hard problems Euler circuit vs. Hamiltonian circuit Shortest Path vs. Longest Path 2-pairs sum vs. general Subset Sum Reducing one problem to another Clique
Topic: Greedy Approximations: Set Cover and Min Makespan Date: 1/30/06
CS880: Approximations Algorithms Scribe: Matt Elder Lecturer: Shuchi Chawla Topic: Greedy Approximations: Set Cover and Min Makespan Date: 1/30/06 3.1 Set Cover The Set Cover problem is: Given a set of
Approximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
Handout #Ch7 San Skulrattanakulchai Gustavus Adolphus College Dec 6, 2010. Chapter 7: Digraphs
MCS-236: Graph Theory Handout #Ch7 San Skulrattanakulchai Gustavus Adolphus College Dec 6, 2010 Chapter 7: Digraphs Strong Digraphs Definitions. A digraph is an ordered pair (V, E), where V is the set
Fault Analysis in Software with the Data Interaction of Classes
, pp.189-196 http://dx.doi.org/10.14257/ijsia.2015.9.9.17 Fault Analysis in Software with the Data Interaction of Classes Yan Xiaobo 1 and Wang Yichen 2 1 Science & Technology on Reliability & Environmental
Distributed Computing over Communication Networks: Maximal Independent Set
Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.
OHJ-2306 Introduction to Theoretical Computer Science, Fall 2012 8.11.2012
276 The P vs. NP problem is a major unsolved problem in computer science It is one of the seven Millennium Prize Problems selected by the Clay Mathematics Institute to carry a $ 1,000,000 prize for the
Completion Time Scheduling and the WSRPT Algorithm
Completion Time Scheduling and the WSRPT Algorithm Bo Xiong, Christine Chung Department of Computer Science, Connecticut College, New London, CT {bxiong,cchung}@conncoll.edu Abstract. We consider the online
1 Introduction. Dr. T. Srinivas Department of Mathematics Kakatiya University Warangal 506009, AP, INDIA [email protected]
A New Allgoriitthm for Miiniimum Costt Liinkiing M. Sreenivas Alluri Institute of Management Sciences Hanamkonda 506001, AP, INDIA [email protected] Dr. T. Srinivas Department of Mathematics Kakatiya
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
The Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
14.1 Rent-or-buy problem
CS787: Advanced Algorithms Lecture 14: Online algorithms We now shift focus to a different kind of algorithmic problem where we need to perform some optimization without knowing the input in advance. Algorithms
2.1 Complexity Classes
15-859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look
Finding and counting given length cycles
Finding and counting given length cycles Noga Alon Raphael Yuster Uri Zwick Abstract We present an assortment of methods for finding and counting simple cycles of a given length in directed and undirected
Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
Analysis of Algorithms, I
Analysis of Algorithms, I CSOR W4231.002 Eleni Drinea Computer Science Department Columbia University Thursday, February 26, 2015 Outline 1 Recap 2 Representing graphs 3 Breadth-first search (BFS) 4 Applications
Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)
Distributed Computing over Communication Networks: Topology (with an excursion to P2P) Some administrative comments... There will be a Skript for this part of the lecture. (Same as slides, except for today...
Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new
Reminder: Complexity (1) Parallel Complexity Theory Lecture 6 Number of steps or memory units required to compute some result In terms of input size Using a single processor O(1) says that regardless of
Shift-based Pattern Matching for Compressed Web Traffic
Shift-based Pattern Matching for Compressed Web Traffic Anat Bremler-Barr Computer Science Dept. Interdisciplinary Center, Herzliya, Israel Email: [email protected] Yaron Koral Blavatnik School of Computer
17.6.1 Introduction to Auction Design
CS787: Advanced Algorithms Topic: Sponsored Search Auction Design Presenter(s): Nilay, Srikrishna, Taedong 17.6.1 Introduction to Auction Design The Internet, which started of as a research project in
Stationary random graphs on Z with prescribed iid degrees and finite mean connections
Stationary random graphs on Z with prescribed iid degrees and finite mean connections Maria Deijfen Johan Jonasson February 2006 Abstract Let F be a probability distribution with support on the non-negative
respectively. Section IX concludes the paper.
1 Fast and Scalable Pattern Matching for Network Intrusion Detection Systems Sarang Dharmapurikar, and John Lockwood, Member IEEE Abstract High-speed packet content inspection and filtering devices rely
6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation
6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation Daron Acemoglu and Asu Ozdaglar MIT November 2, 2009 1 Introduction Outline The problem of cooperation Finitely-repeated prisoner s dilemma
2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
Cycles and clique-minors in expanders
Cycles and clique-minors in expanders Benny Sudakov UCLA and Princeton University Expanders Definition: The vertex boundary of a subset X of a graph G: X = { all vertices in G\X with at least one neighbor
Integrating Benders decomposition within Constraint Programming
Integrating Benders decomposition within Constraint Programming Hadrien Cambazard, Narendra Jussien email: {hcambaza,jussien}@emn.fr École des Mines de Nantes, LINA CNRS FRE 2729 4 rue Alfred Kastler BP
Solutions to Homework 6
Solutions to Homework 6 Debasish Das EECS Department, Northwestern University [email protected] 1 Problem 5.24 We want to find light spanning trees with certain special properties. Given is one example
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Total colorings of planar graphs with small maximum degree
Total colorings of planar graphs with small maximum degree Bing Wang 1,, Jian-Liang Wu, Si-Feng Tian 1 Department of Mathematics, Zaozhuang University, Shandong, 77160, China School of Mathematics, Shandong
x a x 2 (1 + x 2 ) n.
Limits and continuity Suppose that we have a function f : R R. Let a R. We say that f(x) tends to the limit l as x tends to a; lim f(x) = l ; x a if, given any real number ɛ > 0, there exists a real number
Outline Introduction Circuits PRGs Uniform Derandomization Refs. Derandomization. A Basic Introduction. Antonis Antonopoulos.
Derandomization A Basic Introduction Antonis Antonopoulos CoReLab Seminar National Technical University of Athens 21/3/2011 1 Introduction History & Frame Basic Results 2 Circuits Definitions Basic Properties
