Efficient parallel string comparison

Similar documents
A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms

Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new

Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new GAP (2) Graph Accessibility Problem (GAP) (1)

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

4/1/2017. PS. Sequences and Series FROM 9.2 AND 9.3 IN THE BOOK AS WELL AS FROM OTHER SOURCES. TODAY IS NATIONAL MANATEE APPRECIATION DAY

Systems of Linear Equations

5.3 The Cross Product in R 3

Design of LDPC codes

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

FACTORING SPARSE POLYNOMIALS

Notes on Determinant

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

(67902) Topics in Theory and Complexity Nov 2, Lecture 7

Dr. Alexander Tiskin Curriculum Vitae

Continued Fractions and the Euclidean Algorithm

Gambling and Data Compression

CIS 700: algorithms for Big Data

An example of a computable

The Characteristic Polynomial

Factorization Theorems

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

The Pointless Machine and Escape of the Clones

Dynamic Programming. Lecture Overview Introduction

Section 8.2 Solving a System of Equations Using Matrices (Guassian Elimination)

Row Echelon Form and Reduced Row Echelon Form

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

ECE 842 Report Implementation of Elliptic Curve Cryptography

Warshall s Algorithm: Transitive Closure

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

Distance Degree Sequences for Network Analysis

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Lecture 3: Finding integer solutions to systems of linear equations

Why? A central concept in Computer Science. Algorithms are ubiquitous.


a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Secure and Efficient Outsourcing of Sequence Comparisons

Solutions to Homework 6

University of Lille I PC first year list of exercises n 7. Review

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

SECTIONS NOTES ON GRAPH THEORY NOTATION AND ITS USE IN THE STUDY OF SPARSE SYMMETRIC MATRICES

Regular Expressions and Automata using Haskell

I. GROUPS: BASIC DEFINITIONS AND EXAMPLES

9.2 Summation Notation

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY

Elementary Matrices and The LU Factorization

Analysis of an Artificial Hormone System (Extended abstract)

Lecture 22: November 10

C H A P T E R. Logic Circuits

COUNTING INDEPENDENT SETS IN SOME CLASSES OF (ALMOST) REGULAR GRAPHS

Discrete Mathematics Problems

Direct Methods for Solving Linear Systems. Matrix Factorization

GRAPH THEORY LECTURE 4: TREES

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Determinants in the Kronecker product of matrices: The incidence matrix of a complete graph

v w is orthogonal to both v and w. the three vectors v, w and v w form a right-handed set of vectors.

Factoring & Primality

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison

Arithmetic and Algebra of Matrices

On line construction of suffix trees 1

GENERATING SETS KEITH CONRAD

Performance Evaluation of Improved Web Search Algorithms

Reduced echelon form: Add the following conditions to conditions 1, 2, and 3 above:

AN ALGORITHM FOR DETERMINING WHETHER A GIVEN BINARY MATROID IS GRAPHIC

Parallel Algorithm for Dense Matrix Multiplication

1.2 Solving a System of Linear Equations

DATA ANALYSIS II. Matrix Algorithms

A Note on Maximum Independent Sets in Rectangle Intersection Graphs

1 Determinants and the Solvability of Linear Systems

A Turán Type Problem Concerning the Powers of the Degrees of a Graph

Integer Factorization using the Quadratic Sieve

CS711008Z Algorithm Design and Analysis

Zachary Monaco Georgia College Olympic Coloring: Go For The Gold

CS473 - Algorithms I

RN-coding of Numbers: New Insights and Some Applications

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Introduction to NP-Completeness Written and copyright c by Jie Wang 1

Euler Paths and Euler Circuits

Dot product and vector projections (Sect. 12.3) There are two main ways to introduce the dot product

How To Find An Optimal Search Protocol For An Oblivious Cell

Fast Sequential Summation Algorithms Using Augmented Data Structures

Finding and counting given length cycles

Lecture 1: Systems of Linear Equations

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

An Introduction to APGL

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Solving Simultaneous Equations and Matrices

Creating Serial Numbers using Design and Print Online. Creating a Barcode in Design and Print Online. Creating a QR Code in Design and Print Online

WRITING PROOFS. Christopher Heil Georgia Institute of Technology

The Ideal Class Group

NP-Completeness and Cook s Theorem

The Determinant: a Means to Calculate Volume

Matrix Algebra. Some Basic Matrix Laws. Before reading the text or the following notes glance at the following list of basic matrix algebra laws.

Transportation Polytopes: a Twenty year Update

5.4 Closest Pair of Points

A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called heap-order property

The Assignment Problem and the Hungarian Method

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma

Factoring Algorithms

Transcription:

Efficient parallel string comparison Peter Krusche and Alexander Tiskin Department of Computer Science ParCo 2007

Outline 1 Introduction Bulk-Synchronous Parallelism String comparison 2 Sequential LCS algorithms Sequential semi-local LCS Divide-and-conquer semi-local LCS 3 The parallel algorithm Parallel score-matrix multiplication Parallel LCS computation

Outline 1 Introduction Bulk-Synchronous Parallelism String comparison 2 Sequential LCS algorithms Sequential semi-local LCS Divide-and-conquer semi-local LCS 3 The parallel algorithm Parallel score-matrix multiplication Parallel LCS computation

BSP Algorithms Bulk-Synchronous Parallelism Model for parallel computation: L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33:103 111, 1990. Main ideas: p processors working asynchronously Can communicate using g operations to transmit one element of data Can synchronise using l sequential operations.

BSP Algorithms Bulk-Synchronous Parallelism Computation proceeds in supersteps Communication takes place at the end of each superstep Between supersteps, barrier-style synchronisation takes place Superstep P1 P2... Pp Barrier-Synchronization Barrier-Synchronization Computation Communication Computation Communication

BSP Algorithms Bulk-Synchronous Parallelism Superstep s has computation cost w s and communication h s = max(h in s, h out s ). When there are S supersteps: Computation work W = 1 s S w s Communication H = 1 s S h s Formula for running time: T = W + g H + l S.

Parallel Prefix in BSP Definition (Parallel Prefix) Given n values x 1, x 2,..., x n and an associative operator, compute the values x 1, x 1 x 2, x 1 x 2 x 3,..., i=1,2,...,n x i. Fact... Under some natural assumptions, we can carry out a parallel prefix operation over n elements on a BSP computer with p processors using W = O( n p ), H = O(p) and S = O(1).

Outline 1 Introduction Bulk-Synchronous Parallelism String comparison 2 Sequential LCS algorithms Sequential semi-local LCS Divide-and-conquer semi-local LCS 3 The parallel algorithm Parallel score-matrix multiplication Parallel LCS computation

The LCS Problem Definition (Input data) Let x = x 1 x 2... x m and y = y 1 y 2... y n be two strings on an alphabet Σ. Definition (Subsequences) A subsequence u of x: u can be obtained by deleting zero or more elements from x. Definition (Longest Common Subsequences) An LCS (x, y) is any string which is subsequence of both x and y and has maximum possible length. Length of these sequences: LLCS (x, y).

The Semi-local LCS Problem Definition (Substrings) A substring of any string x can be obtained by removing zero or more characters from the beginning and/or the end of x. Definition (Highest-score matrix) The element A(i, j) of the LCS highest-score matrix of two strings x and y gives the LLCS of y i... y j and x. Definition (Semi-local LCS) Solutions to the semi-local LCS problem are represented by a (possibly implicit) highest-score matrix A(i, j).

Outline 1 Introduction Bulk-Synchronous Parallelism String comparison 2 Sequential LCS algorithms Sequential semi-local LCS Divide-and-conquer semi-local LCS 3 The parallel algorithm Parallel score-matrix multiplication Parallel LCS computation

LCS grid dags and highest-score matrices LCS Problem can be represented as longest path problem in a Grid DAG String-Substring LCS Problem A(i, j) = length of longest path from (0, i) to (n, j) (top to bottom). Example y a b c a x a c b c

Extended grid dag Infinite extension of the LCS grid dag, outside the core area, everything matches: The extended highest-score matrix is now defined on indices [, + ] [, + ].

Additional definitions Definition (Integer ranges) We denote the set of integers {i, i + 1,..., j} as [i : j]. Definition (Odd half-integers) We denote half-integer variables using a ˆ, and denote the set of half-integers {i + 1 2, i + 3 2,..., j 1 2 } as i : j.

Critical point theorem Definition (Critical Point) Odd half-integer point (i 1 2, j + 1 2 ) is critical iff. A(i, j) + 1 = A(i 1, j) = A(i, j + 1) = A(i 1, j + 1). Theorem (Schmidt 95, Alves+ 06, Tiskin 05) 1 We can represent a the whole extended highest-score matrix by a finite set of such critical points. 2 Assuming w.l.o.g. input strings of equal length n, there are N = 2n such critical points that implicitly represent the whole score matrix. 3 There is an algorithm to obtain these points in time O(n 2 ).

Highest-score matrices Example (Explicit highest-score matrix) 0 1 2 3 4 5 6 7-4 4 4 4 4 4 4 4 4-3 3 3 3 4 4 4 4 4-2 2 2 3 4 4 4 4 4-1 1 1 2 3 3 3 4 4 0 0 1 2 3 3 3 4 4 1-1 0 1 2 2 2 3 4 2-2 -1 0 1 1 2 3 4 3-3 -2-1 0 1 2 3 4

Querying highest-score matrix entries Theorem (Tiskin 05) If d(i, j) is the number of critical points (î, ĵ) in the extended score matrix with i < î and ĵ < j, then A(i, j) = j i d(i, j). Definition (Density and distribution matrices) The elements d(i, j) form a distribution matrix over the entries of density (permutation) matrix D with nonzeros at all critical points (î, ĵ) in the extended highest-score matrix: d(i, j) = D(î, ĵ) (î,ĵ) i:n 0:j Can compute this efficiently using range querying data structures.

Querying highest-score matrix entries Theorem (Tiskin 05) If d(i, j) is the number of critical points (î, ĵ) in the extended score matrix with i < î and ĵ < j, then A(i, j) = j i d(i, j). Definition (Density and distribution matrices) The elements d(i, j) form a distribution matrix over the entries of density (permutation) matrix D with nonzeros at all critical points (î, ĵ) in the extended highest-score matrix: d(i, j) = D(î, ĵ) (î,ĵ) i:n 0:j Can compute this efficiently using range querying data structures.

Outline 1 Introduction Bulk-Synchronous Parallelism String comparison 2 Sequential LCS algorithms Sequential semi-local LCS Divide-and-conquer semi-local LCS 3 The parallel algorithm Parallel score-matrix multiplication Parallel LCS computation

Sequential highest-score matrix multiplication Algorithm, Tiskin 05 Given the distribution matrices d A and d B for two adjacent blocks of equal height M and width N in the grid dag, we can compute the distribution matrix d C for the union of these blocks in O(N 1.5 + M).

Sequential highest-score matrix multiplication Combine critical points by removing double crossings Trivial part (O(M)): -4-3 -2-1 0 1 2 3 4 5 6 7 8

Sequential highest-score matrix multiplication Combine critical points by removing double crossings Nontrivial part (O(N 1.5 )): -4-3 -2-1 0 1 2 3 4 5 6 7 8

Nontrivial part The non-trivial part can be seen as (min, +) matrix product (d A B C are now the nontrivial parts of the corresponding distribution matrices): d C (i, k) = min(d A (i, j) + d B (j, k)) j Explicit form, naive algorithm: O(n 3 ) Explicit form, algorithm that uses highest-score matrix properties: O(n 2 ) Implicit form, divide-and-conquer: O(n 1.5 )

Nontrivial part The non-trivial part can be seen as (min, +) matrix product (d A B C are now the nontrivial parts of the corresponding distribution matrices): d C (i, k) = min(d A (i, j) + d B (j, k)) j Explicit form, naive algorithm: O(n 3 ) Explicit form, algorithm that uses highest-score matrix properties: O(n 2 ) Implicit form, divide-and-conquer: O(n 1.5 )

Nontrivial part The non-trivial part can be seen as (min, +) matrix product (d A B C are now the nontrivial parts of the corresponding distribution matrices): d C (i, k) = min(d A (i, j) + d B (j, k)) j Explicit form, naive algorithm: O(n 3 ) Explicit form, algorithm that uses highest-score matrix properties: O(n 2 ) Implicit form, divide-and-conquer: O(n 1.5 )

Nontrivial part The non-trivial part can be seen as (min, +) matrix product (d A B C are now the nontrivial parts of the corresponding distribution matrices): d C (i, k) = min(d A (i, j) + d B (j, k)) j Explicit form, naive algorithm: O(n 3 ) Explicit form, algorithm that uses highest-score matrix properties: O(n 2 ) Implicit form, divide-and-conquer: O(n 1.5 )

Divide-and-conquer multiplication C-blocks and relevant nonzeros D B D A D C

Divide-and-conquer multiplication C-blocks and relevant nonzeros relevant nonzeros D B D A C-block sized h h D C

Divide-and-conquer multiplication C-blocks and relevant nonzeros relevant nonzeros D B D A C-block sized h h D C

Divide-and-conquer multiplication C-blocks and relevant nonzeros relevant nonzeros D B D A C-block sized h h D C

Divide-and-conquer multiplication C-blocks and relevant nonzeros relevant nonzeros D B D A C-block sized h h local minimum D C

Divide-and-conquer multiplication δ-sequences and relevant nonzeros D A D B D C j Splitting relevant nonzeros in D A and D B into two sets at a position j [0 : N], we get numbers δ A (j) of relevant nonzeros in D A up to column j 1 2 δ B (j) of relevant nonzeros in D B starting at row j + 1 2

Divide-and-conquer multiplication j-blocks Example (j-blocks) h δ A d = h d = 0 δ B d = h j Definition ( -sequences) δ s don t change inside a j-block A (d) = any δ B (j) j J (d) B (d) = any δ B (j) j J (d)

Divide-and-conquer multiplication j-blocks Example (j-blocks) h δ A d = h d = 0 δ B d = h j Definition ( -sequences) δ s don t change inside a j-block A (d) = any δ B (j) j J (d) B (d) = any δ B (j) j J (d)

Divide-and-conquer multiplication Local minima Definition (local minima) The sequence M (d) = min (d A(i 0, j) + d B (j, k 0 )) j J (d) contains the minimum of d A (i 0, j) + d B (j, k 0 ) in every j-block. We can use M s and s to compute the number of nonzeros in a C-block.

Divide-and-conquer multiplication Local minima Definition (local minima) The sequence M (d) = min (d A(i 0, j) + d B (j, k 0 )) j J (d) contains the minimum of d A (i 0, j) + d B (j, k 0 ) in every j-block. We can use M s and s to compute the number of nonzeros in a C-block.

Divide-and-conquer multiplication Recursive step Sequences M for every C-subblock can be computed in O(h): M (d ) = min d M (d ) = min d M (d ) = min d M (d ) = min d M (d), M (d) + B (d), M (d) + A (d), M (d) + A (d) + B (d) having (i,k, h 2 ) A (d) (i,k, h 2 ) B (d) = d with d [ h 2 : h 2 ].

Divide-and-conquer multiplication Recursive step Sequences A (d ) and B (d ) can also be determined in O(h) by a scan of the relevant nonzeros for each subblock. Knowing A (d ), B (d ) and M(d ) for each subblock, we can continue the recursion in every subblock. The recursion terminates when N C-blocks of size 1 are left.

Outline 1 Introduction Bulk-Synchronous Parallelism String comparison 2 Sequential LCS algorithms Sequential semi-local LCS Divide-and-conquer semi-local LCS 3 The parallel algorithm Parallel score-matrix multiplication Parallel LCS computation

Basic algorithm idea Start the recursion at a point where there are p C-blocks. This is at level 1 2 log p. Precompute and distribute the required sequences and M for each C-block in parallel. Every C-block has size h = N p, and hence requires sequences with O( N p ) values. After these sequences have been precomputed and redistributed, we can use the sequential algorithm to finish the computation.

Assumptions Assume that: p is an integer. Every processor has unique identifier q with 0 q < p. Every processor q corresponds to exactly one location (q x, q y ) [0 : p 1] [0 : p 1]. Initial distribution of nonzeros in D A and D B is assumed to be even among all processors.

First Step Redistribute the nonzeros to strips of width N p Send all nonzeros (î, ĵ) in D A and (ĵ, ˆk) in D B to processor (ĵ 1 2 ) p/n. Possible in one superstep using communication O( N p ). i j D A k D B D C p vertical strips N nonzeros each p }{{} N p } N p

Precomputing M k j i D A D B D C 1 Compute the elementary (min, +) products d A ( x, j) + d B (j, y ) along j [0 : N]. 2 Processor q holds all D A (î, ĵ) and all D B (ĵ, ˆk) for ĵ q N N p : (q + 1) p. 3 Can compute the values d A ( x, j) and d B (j, y ) by using parallel prefix/suffix. 4 After prefix and suffix computations, every processor holds N/p values d A ( x, j) + d B (j, y ) for j [q N p : (q + 1) N p ].

Precomputing M k j i D A D B D C 1 Compute the elementary (min, +) products d A ( x, j) + d B (j, y ) along j [0 : N]. 2 Processor q holds all D A (î, ĵ) and all D B (ĵ, ˆk) for ĵ q N N p : (q + 1) p. 3 Can compute the values d A ( x, j) and d B (j, y ) by using parallel prefix/suffix. 4 After prefix and suffix computations, every processor holds N/p values d A ( x, j) + d B (j, y ) for j [q N p : (q + 1) N p ].

Precomputing M k j i D A D B D C 1 Compute the elementary (min, +) products d A ( x, j) + d B (j, y ) along j [0 : N]. 2 Processor q holds all D A (î, ĵ) and all D B (ĵ, ˆk) for ĵ q N N p : (q + 1) p. 3 Can compute the values d A ( x, j) and d B (j, y ) by using parallel prefix/suffix. 4 After prefix and suffix computations, every processor holds N/p values d A ( x, j) + d B (j, y ) for j [q N p : (q + 1) N p ].

Precomputing M k j i D A D B D C 1 Compute the elementary (min, +) products d A ( x, j) + d B (j, y ) along j [0 : N]. 2 Processor q holds all D A (î, ĵ) and all D B (ĵ, ˆk) for ĵ q N N p : (q + 1) p. 3 Can compute the values d A ( x, j) and d B (j, y ) by using parallel prefix/suffix. 4 After prefix and suffix computations, every processor holds N/p values d A ( x, j) + d B (j, y ) for j [q N p : (q + 1) N p ].

Redistribution Step i j k D B D B D A D A D C p vertical strips N p nonzeros each D C p horizontal strips with nonzeros each N p

Analysis Computational work bounded by the sequential recursion: W = O((N/ p) 1.5 ) = O(N 1.5 /p 0.75 ) Every processor holds O(N/p) nonzeros before redistribution. Every nonzero is relevant for p C-blocks. O(N/ p) communication for redistributing the nonzeros. H = O(N/ p + p + N/ p) = O(N/ p) (if N/ p > p N > p 1.5 ) S = O(1) (parallel prefix)

Analysis Computational work bounded by the sequential recursion: W = O((N/ p) 1.5 ) = O(N 1.5 /p 0.75 ) Every processor holds O(N/p) nonzeros before redistribution. Every nonzero is relevant for p C-blocks. O(N/ p) communication for redistributing the nonzeros. H = O(N/ p + p + N/ p) = O(N/ p) (if N/ p > p N > p 1.5 ) S = O(1) (parallel prefix)

Analysis Computational work bounded by the sequential recursion: W = O((N/ p) 1.5 ) = O(N 1.5 /p 0.75 ) Every processor holds O(N/p) nonzeros before redistribution. Every nonzero is relevant for p C-blocks. O(N/ p) communication for redistributing the nonzeros. H = O(N/ p + p + N/ p) = O(N/ p) (if N/ p > p N > p 1.5 ) S = O(1) (parallel prefix)

Analysis Computational work bounded by the sequential recursion: W = O((N/ p) 1.5 ) = O(N 1.5 /p 0.75 ) Every processor holds O(N/p) nonzeros before redistribution. Every nonzero is relevant for p C-blocks. O(N/ p) communication for redistributing the nonzeros. H = O(N/ p + p + N/ p) = O(N/ p) (if N/ p > p N > p 1.5 ) S = O(1) (parallel prefix)

Outline 1 Introduction Bulk-Synchronous Parallelism String comparison 2 Sequential LCS algorithms Sequential semi-local LCS Divide-and-conquer semi-local LCS 3 The parallel algorithm Parallel score-matrix multiplication Parallel LCS computation

Quadtree Merging First, compute scores for a regular grid of p sub-dags of size n/ p n/ p Then merge these in a quadtree-like scheme using parallel score-matrix multiplication:

Analysis Quadtree has 1 2 log 2 p levels On level l, 0 l 1 2 log 2 p, we have p l = p (number of processors that work together on one 4 l merge) N l = N 2 l w l = O h l = O (block size of merge) ) ( ) = O N 1.5 p 0.75 ( ( ) 1.5 N 2 ( l ) p 0.75 4 l ( N 2 l ( p 4 l ) 0.5 ) = O ( N p 0.5 )

Analysis Quadtree has 1 2 log 2 p levels Hence, we get work W = O( n2 p + N1.5 log p ) = O(n 2 /p) p 0.75 (assuming that n p 2 ), communication H = O( n log p p ), and S = O(log p) supersteps.

Comparison to other parallel algorithms W H S References Global LCS O( n2 p ) O(n) O(p) [McColl 95]+ [Wagner & Fischer 74] String-Substring LCS O( n2 p ) O(Cp1/C n log p) O(log p) [Alves+ 03] O( n2 p ) O(n log p) O(log p) [Tiskin 05], [Alves+:06] String-Substring, Prefix-Suffix LCS O( n2 log n p ) O( n2 log p p ) O(log p) [Alves+ 02] O( n2 p ) O(n) O(p) [McColl 95]+ [Alves+ 06], [Tiskin 05] O( n2 p ) n log p O( p ) O(log p) NEW

Summary and Outlook Summary We have looked at a parallel algorithm for semilocal string comparison that is communication efficient (in fact, achieving scalable communication), work-optimal,! and asymptotically better than even global LCS computation. Outlook Score matrix multiplication can also be applied to create a scalable algorithm for the longest increasing subsequence problem. Algorithm can be adapted to compute edit-distances.

Summary and Outlook Summary We have looked at a parallel algorithm for semilocal string comparison that is communication efficient (in fact, achieving scalable communication), work-optimal,! and asymptotically better than even global LCS computation. Outlook Score matrix multiplication can also be applied to create a scalable algorithm for the longest increasing subsequence problem. Algorithm can be adapted to compute edit-distances.

Thank you! Any questions?