A parallel algorithm for the extraction of structured motifs

Similar documents

Analysis of Algorithms I: Optimal Binary Search Trees

The LCA Problem Revisited

On line construction of suffix trees 1

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Analysis of Algorithms I: Binary Search Trees

Efficient Recovery of Secrets

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Arithmetic Coding: Introduction

Distance Degree Sequences for Network Analysis

Network (Tree) Topology Inference Based on Prüfer Sequence

The following themes form the major topics of this chapter: The terms and concepts related to trees (Section 5.2).

DNA is found in all organisms from the smallest bacteria to humans. DNA has the same composition and structure in all organisms!

On Covert Data Communication Channels Employing DNA Steganography with Application in Massive Data Storage

Offline sorting buffers on Line

Solutions to Homework 6

2. (a) Explain the strassen s matrix multiplication. (b) Write deletion algorithm, of Binary search tree. [8+8]

OPTIMAL BINARY SEARCH TREES

SOLiD System accuracy with the Exact Call Chemistry module

agucacaaacgcu agugcuaguuua uaugcagucuua

Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy

CS711008Z Algorithm Design and Analysis

Translation. Translation: Assembly of polypeptides on a ribosome

RNA and Protein Synthesis

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM

Physical Data Organization

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Single machine models: Maximum Lateness -12- Approximation ratio for EDD for problem 1 r j,d j < 0 L max. structure of a schedule Q...

Binary Trees and Huffman Encoding Binary Search Trees

Home Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit

GRAPH THEORY LECTURE 4: TREES

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

Extended Application of Suffix Trees to Data Compression

Ph.D. Thesis. Judit Nagy-György. Supervisor: Péter Hajnal Associate Professor

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Partitioning and Divide and Conquer Strategies

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

On Frequency Assignment in Cellular Networks

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

Topological Properties

Output: struct treenode{ int data; struct treenode *left, *right; } struct treenode *tree_ptr;

A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching

GenBank, Entrez, & FASTA

HMM : Viterbi algorithm - a toy example

Process Mining by Measuring Process Block Similarity

GENERATING THE FIBONACCI CHAIN IN O(log n) SPACE AND O(n) TIME J. Patera

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

SIMS 255 Foundations of Software Design. Complexity and NP-completeness

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

On Binary Signed Digit Representations of Integers

Image Compression through DCT and Huffman Coding Technique

Binary Coded Web Access Pattern Tree in Education Domain

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

Suffix Tree Construction and Storage with Limited Main Memory

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Classification/Decision Trees (II)

Vector storage and access; algorithms in GIS. This is lecture 6

EE602 Algorithms GEOMETRIC INTERSECTION CHAPTER 27

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

Web Document Clustering

Sorting Hierarchical Data in External Memory for Archiving

Load Balancing. Load Balancing 1 / 24

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

Near Optimal Solutions

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

A Non-Linear Schema Theorem for Genetic Algorithms

Load Balancing between Computing Clusters

Less naive Bayes spam detection

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Name Class Date. binomial nomenclature. MAIN IDEA: Linnaeus developed the scientific naming system still used today.

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Scheduling Shop Scheduling. Tim Nieberg

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Using Lexical Similarity in Handwritten Word Recognition

Introduction to Scheduling Theory

A Business Process Driven Approach for Generating Software Modules

A new binary floating-point division algorithm and its software implementation on the ST231 processor

Nucleotides and Nucleic Acids

Approximation Algorithms

Lempel-Ziv Factorization: LZ77 without Window

Learning Outcomes. COMP202 Complexity of Algorithms. Binary Search Trees and Other Search Trees

ONLINE DEGREE-BOUNDED STEINER NETWORK DESIGN. Sina Dehghani Saeed Seddighin Ali Shafahi Fall 2015

Generating models of a matched formula with a polynomial delay

Scheduling Single Machine Scheduling. Tim Nieberg

Scan-Line Fill. Scan-Line Algorithm. Sort by scan line Fill each span vertex order generated by vertex list

Euclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li. Advised by: Dave Mount. May 22, 2014

Transcription:

parallel algorithm for the extraction of structured motifs lexandra arvalho MEI 2002/03 omputação em Sistemas Distribuídos 2003 p.1/27

Plan of the talk Biological model of regulation Nucleic acids: DN and RN lassification of living organisms: Prokaryotes and Eukaryotes ranscription and ranslation Promoter and Regulatory Sequences omputational model of regulation Suffix tree and generalized suffix tree Single models extraction [M.-F. Sagot, Latin, 1998] Structured models extraction [L. Marsan and M.-F. Sagot, J. omputational Biology, 2000] Parallelization [. arvalho,. Freitas,. Oliveira and M.-F. Sagot, submitted, 2003] he PRIION UP O ε problem he Simpleut algorithm he tree partition problem he PSMILE algorithm Experimental results omputação em Sistemas Distribuídos 2003 p.2/27

Nucleic acids Nucleotides: storage and retrieval of biological information building blocks for the construction of nucleic acids omputação em Sistemas Distribuídos 2003 p.3/27

Nucleic acids Nucleotides: storage and retrieval of biological information building blocks for the construction of nucleic acids wo main types of nucleic acids: DN - DeoxyriboNucleic cid: contain the bases,,, and double-stranded molecule RN - RiboNucleic cid: contain the bases,,, and U single-stranded molecule omputação em Sistemas Distribuídos 2003 p.3/27

lassification of living organisms Prokaryotes: reek words: pro "before"and karyon "nucleus" bacteria and prokaryotes are generally used interchangeably most prokaryotes live as single-celled organisms Eukaryotes: reek words: eu "well"and karyon "nucleus" yeast is an eukaryotic single-celled organism omputação em Sistemas Distribuídos 2003 p.4/27

ranscription and ranslation omputação em Sistemas Distribuídos 2003 p.5/27

Promoter and Regulatory Sequences omputação em Sistemas Distribuídos 2003 p.6/27

Structured motifs Definition. model model is an element in Σ +. Definition. structured model structured model is a pair (m, d) where: m = (m i ) 1 i p, denoting the p boxes d = (d mini, d maxi, δ i ) 1 i p 1, denoting the p 1 intervals of distance omputação em Sistemas Distribuídos 2003 p.7/27

Structured motifs Definition. model model is an element in Σ +. Definition. structured model structured model is a pair (m, d) where: m = (m i ) 1 i p, denoting the p boxes d = (d mini, d maxi, δ i ) 1 i p 1, denoting the p 1 intervals of distance m1 e1 subst. [d1min,d1max] m2 e2 subst. mp ep subst. k1 k2 kp p boxes n attempt to model the combinatorics of regulation omputação em Sistemas Distribuídos 2003 p.7/27

Structured motifs Definition. e-occurrence model m e-occurs in the input sequences if exists u in the input sequences such that HammingDistance(m, u) e (minimum number of substitutions to transform u into m). omputação em Sistemas Distribuídos 2003 p.8/27

Structured motifs Definition. e-occurrence model m e-occurs in the input sequences if exists u in the input sequences such that HammingDistance(m, u) e (minimum number of substitutions to transform u into m). Definition. valid model, quorum model is valid if e-occurs in at least q input sequences, where q is called the quorum. omputação em Sistemas Distribuídos 2003 p.8/27

Structured motifs Definition. e-occurrence model m e-occurs in the input sequences if exists u in the input sequences such that HammingDistance(m, u) e (minimum number of substitutions to transform u into m). Definition. valid model, quorum model is valid if e-occurs in at least q input sequences, where q is called the quorum. e1=2 e2=1 [16,18] k1=[6,8] k2=[6,8] q=3 omputação em Sistemas Distribuídos 2003 p.8/27

Structured motifs Definition. e-occurrence model m e-occurs in the input sequences if exists u in the input sequences such that HammingDistance(m, u) e (minimum number of substitutions to transform u into m). Definition. valid model, quorum model is valid if e-occurs in at least q input sequences, where q is called the quorum. e1=2 e2=1 [16,18] k1=[6,8] k2=[6,8] 17 16 too far 17 q=3 omputação em Sistemas Distribuídos 2003 p.8/27

Input sequences >strand + guab inositol-monophosphate dehydrogenas >strand - yaa yaa >strand + yaaj similar to hypothetical proteins >strand - yaai similar to isochorismatase >strand + mets methionyl-trn synthetase omputação em Sistemas Distribuídos 2003 p.9/27

Suffix ree Definition. Suffix tree suffix tree of a n-character string S is a rooted directed tree with exactly n leaves: leaves are numbered 1 to n each internal node has at least two children each edge is labeled with a nonempty substring of S no two edges out of a node can have edge-labels beginning with the same character he key feature of the suffix tree is that for any leaf i, the label of the path from the root to the leaf i exactly spells out the suffix of S that starts at position i. Weiner, IEEE Symposium on Switching and utomata heory, 1973 Ukkonen, lgorithmica, 1995 heorem. Suffix trees can be built in linear-time. omputação em Sistemas Distribuídos 2003 p.10/27

Suffix ree Suffix tree for the string omputação em Sistemas Distribuídos 2003 p.11/27

Suffix ree Suffix tree for the string omputação em Sistemas Distribuídos 2003 p.11/27

Suffix ree Suffix tree for the string omputação em Sistemas Distribuídos 2003 p.11/27

Suffix ree Suffix tree for the string omputação em Sistemas Distribuídos 2003 p.11/27

Suffix ree Suffix tree for the string omputação em Sistemas Distribuídos 2003 p.11/27

Suffix ree Suffix tree for the string omputação em Sistemas Distribuídos 2003 p.11/27

Suffix ree Suffix tree for the string omputação em Sistemas Distribuídos 2003 p.11/27

eneralized Suffix ree Suffix tree for the strings and # omputação em Sistemas Distribuídos 2003 p.12/27

eneralized Suffix ree Suffix tree for the strings and # omputação em Sistemas Distribuídos 2003 p.12/27

eneralized Suffix ree Suffix tree for the strings and # # omputação em Sistemas Distribuídos 2003 p.12/27

eneralized Suffix ree Suffix tree for the strings and # # # omputação em Sistemas Distribuídos 2003 p.12/27

eneralized Suffix ree Suffix tree for the strings and # # # # omputação em Sistemas Distribuídos 2003 p.12/27

eneralized Suffix ree Suffix tree for the strings and # # # # # omputação em Sistemas Distribuídos 2003 p.12/27

eneralized Suffix ree Suffix tree for the strings and # # # # # # omputação em Sistemas Distribuídos 2003 p.12/27

eneralized Suffix ree Suffix tree for the strings and #,# # # # # # omputação em Sistemas Distribuídos 2003 p.12/27

eneralized Suffix ree Suffix tree for the strings and #,# # # # # #,# omputação em Sistemas Distribuídos 2003 p.12/27

eneralized Suffix ree with olors Suffix tree for the strings and # [0,1] #,#,# # [0,1] [0,1] [1,0] [1,0] # [0,1] # [1,0] [1,0] [0,1] [1,0] # [0,1] [ ]: bit vectors called olors omputação em Sistemas Distribuídos 2003 p.13/27

Extraction of Single Models Definition. e-node-occurrence e-node-occurrence of a model m is represented by a pair (v, e v ) where v is a tree node and e v e is the Hamming distance between the label of the path from the root to v and m. Notation. ν(e, k) he number of distinct words at Hamming distance at most e from a k-long word: ν(e, k) = e k i i=0 ( Σ 1) i k e Σ e. Notation. n k he number of tree nodes at depth k of a suffix tree. omputação em Sistemas Distribuídos 2003 p.14/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and [1,0] [0,1] [1,0] omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and [1,0] [0,1] [1,0] (,0); (,1); (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and [1,0] [0,1] [1,0] (,0); (,1); (,1) (,1); (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and [1,0] [0,1] [1,0] (,0); (,1); (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and [1,0] [0,1] [1,0] (,0); (,1); (,1) (,0); (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) [1,0] [0,1] [1,0] (,0); (,1); (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) [1,0] [0,1] [1,0] (,0); (,1); (,1) (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) [1,0] [0,1] [1,0] (,0); (,1); (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) [1,0] [0,1] [1,0] (,0); (,1); (,1) (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) [1,0] [0,1] [1,0] (,0); (,1); (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) [1,0] [0,1] [1,0] omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) [1,0] [0,1] [1,0] (,1); (,0); (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) [1,0] [0,1] [1,0] (,1); (,0); (,1) (,1); (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) (,1) (,1) [1,0] [0,1] [1,0] (,1); (,0); (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) (,1) (,1) [1,0] [0,1] [1,0] (,1); (,0); (,1) (,1); (,0) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) (,1) (,1) (,1) (,0) [1,0] [0,1] [1,0] (,1); (,0); (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) (,1) (,1) (,1) (,0) [1,0] [0,1] [1,0] (,1); (,0); (,1) (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) (,1) (,1) (,1) (,0) [1,0] [0,1] [1,0] (,1); (,0); (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) (,1) (,1) (,1) (,0) [1,0] [0,1] [1,0] (,1); (,0); (,1) (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) (,1) (,1) (,1) (,0) [1,0] [0,1] [1,0] (,1); (,0); (,1) omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) (,1) (,1) (,1) (,0) [1,0] [0,1] [1,0] omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) (,1) (,1) (,1) (,0) (,1) (,1) (,1) (,1) [1,0] [0,1] [1,0] omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Single Models M.-F. Sagot, Latin, 1998 k = 2 e = 1 q = 2 Input sequences: and (,0) (,1) (,1) (,1) (,1) (,0) (,1) (,1) (,1) (,1) [1,0] [0,1] [1,0] Proposition. he single motifs extraction takes O(Nn k ν(e, k)) time. omputação em Sistemas Distribuídos 2003 p.15/27

Extraction of Structured Models: SMILE L. Marsan and M.-F. Sagot, Journal of omputational Biology, 2000 ExtractModels(Model m, Block i) 1. for each node-occurrence v of m 2. if (i > 1) 3. put in P otencialstarts the children of v at levels (i 1)k + (i 1)d mini 1 to (i 1)k + (i 1)d maxi 1 4. else 5. put v in P otencialstarts 6. for each model m i obtained by doing a recursive depth-first traversal from the root of the virtual model tree M while simultaneously traversing from the node-occurrences in P otencialstarts 7. if (i < p) 8. ExtractModels(m = m 1... m i,i + 1) 9. else 10. KeepModel( (m 1,..., m p ), ((d min1, d max1 ),..., (d minp, d maxp )) ) omputação em Sistemas Distribuídos 2003 p.16/27

Extraction of Structured Models: SMILE L. Marsan and M.-F. Sagot, Journal of omputational Biology, 2000 p = 2 k 1 = 2, d = 1, k 2 = 2 e 1 = 1, e 2 = 1 q = 2 Input sequences: and [0,1] [1,0] [0,1] [1,0] omputação em Sistemas Distribuídos 2003 p.17/27

Extraction of Structured Models: SMILE L. Marsan and M.-F. Sagot, Journal of omputational Biology, 2000 p = 2 k 1 = 2, d = 1, k 2 = 2 e 1 = 1, e 2 = 1 q = 2 Input sequences: and [0,1] [1,0] [0,1] [1,0] omputação em Sistemas Distribuídos 2003 p.17/27

Extraction of Structured Models: SMILE L. Marsan and M.-F. Sagot, Journal of omputational Biology, 2000 p = 2 k 1 = 2, d = 1, k 2 = 2 e 1 = 1, e 2 = 1 q = 2 Input sequences: and (,1) (,1) [0,1] [1,0] [0,1] [1,0] omputação em Sistemas Distribuídos 2003 p.17/27

Extraction of Structured Models: SMILE L. Marsan and M.-F. Sagot, Journal of omputational Biology, 2000 p = 2 k 1 = 2, d = 1, k 2 = 2 e 1 = 1, e 2 = 1 q = 2 Input sequences: and (,1) (,1) [0,1] [1,0] [0,1] [1,0] omputação em Sistemas Distribuídos 2003 p.17/27

Extraction of Structured Models: SMILE L. Marsan and M.-F. Sagot, Journal of omputational Biology, 2000 p = 2 k 1 = 2, d = 1, k 2 = 2 e 1 = 1, e 2 = 1 q = 2 Input sequences: and (,1) [0,1] [1,0] [0,1] [1,0] omputação em Sistemas Distribuídos 2003 p.17/27

Extraction of Structured Models: SMILE L. Marsan and M.-F. Sagot, Journal of omputational Biology, 2000 p = 2 k 1 = 2, d = 1, k 2 = 2 e 1 = 1, e 2 = 1 q = 2 Input sequences: and (,1) [0,1] [1,0] [0,1] [1,0] omputação em Sistemas Distribuídos 2003 p.17/27

Extraction of Structured Models: SMILE L. Marsan and M.-F. Sagot, Journal of omputational Biology, 2000 p = 2 k 1 = 2, d = 1, k 2 = 2 e 1 = 1, e 2 = 1 q = 2 Input sequences: and (,1) [0,1] [1,0] [0,1] [1,0] Proposition. he structured motifs extraction takes O(Nn pk+(p 1)dmax ν p (e, k)) time. omputação em Sistemas Distribuídos 2003 p.17/27

PRIION UP O ε PRIION UP O ε problem: l gold bars w i 0 is the the weight of the ith gold bar any gold bar can be cut in c equal parts Optimization version: he problem is how to share the gold between r persons, with the minimum number of gold bars z, in such a way that each person gets the same share of gold up to some weight ε > 0. Decision version: he problem is to decide whether it is possible to share the gold between r persons, with z gold bars, in such a way that each person gets the same share of gold up to some weight ε 0. Proposition. he PRIION UP O ε problem is NP-complete in the strong sense. omputação em Sistemas Distribuídos 2003 p.18/27

PRIION UP O ε Simpleut(Partition i, oldbars l, Persons r, Weights w j, utfactor c, WorkOverload ε) 1. find the smallest t such that max w j c t ε 2. for each j {1,..., l} [ j 1 3. let V j = k=1 w k c t, ) j k=1 w k c t 4. let w = l j=1 w j 5. let γ = w c t mod r 6. let δ = w ct r 7. let I i [(i 1)(δ + 1), i(δ + 1) ) = [γ(δ + 1) + (i (γ + 1))δ, γ(δ + 1) + (i γ)δ ) for all i γ otherwise 8. transform I i = [a, b) into I i = [f(a), f(b)) with f : w c t l c t : f(x) = (j 1) c t + x inf(v j ) w j l c t for all x V j if x = w c t omputação em Sistemas Distribuídos 2003 p.19/27

PRIION UP O ε j 1 2 w j 2 1 r = 3 ε = 1 omputação em Sistemas Distribuídos 2003 p.20/27

PRIION UP O ε j 1 2 w j 2 1 r = 3 ε = 1 t = 1 1. find the smallest t such that max w j c t ε omputação em Sistemas Distribuídos 2003 p.20/27

PRIION UP O ε j 1 2 w j 2 1 r = 3 ε = 1 t = 1 2. for each j[ 1,..., l j 1 3. V j = k=1 w k c t, ) j k=1 w k c t V1=[0,4) V2=[4,6) omputação em Sistemas Distribuídos 2003 p.20/27

PRIION UP O ε j 1 2 w j 2 1 r = 3 ε = 1 t = 1 w = 3 γ = 0 δ = 2 4. w = l j=1 w j 5. γ = w c t mod r 6. δ = w ct r V1=[0,4) V2=[4,6) omputação em Sistemas Distribuídos 2003 p.20/27

PRIION UP O ε j 1 2 w j 2 1 { 7. I i [(i 1)(δ + 1), i(δ + 1) ) = for all i γ [γ(δ + 1) + (i (γ + 1))δ, γ(δ + 1) + (i γ)δ ) otherwise r = 3 ε = 1 t = 1 w = 3 γ = 0 δ = 2 V1=[0,4) V2=[4,6) I 1=[0,2) I 2=[2,4) I 3=[4,6) omputação em Sistemas Distribuídos 2003 p.20/27

PRIION UP O ε j 1 2 w j 2 1 r = 3 ε = 1 t = 1 w = 3 γ = 0 δ = 2 8. transform I i = [a, b) into I i = [f(a), f(b)) with { (j 1) c t + x inf(v j ) for all x V f(x) = w j j l c t if x = w c t V1=[0,4) V2=[4,6) I 1=[0,2) I 2=[2,4) I 3=[4,6) I1=[0,1) I2=[1,2) I3=[2,4) omputação em Sistemas Distribuídos 2003 p.20/27

PRIION UP O ε Proposition. he Simpleut algorithm requires O(l) time. Proposition. he Simpleut algorithm has a ratio bound ρ(l, r, (w i ) 1 i l, c, ε) = O( max w i ε ). omputação em Sistemas Distribuídos 2003 p.21/27

Parallelization Reducing the tree partition problem to the PRIION UP O ε problem Input of the Simpleut algorithm for the ith grid node: l = Σ r matches the number of grid nodes w j of each symbol of the alphabet is obtained by scanning the input sequences c = Σ ε is an user parameter omputação em Sistemas Distribuídos 2003 p.22/27

Parallelization Reducing the tree partition problem to the PRIION UP O ε problem Input of the Simpleut algorithm for the ith grid node: l = Σ r matches the number of grid nodes w j of each symbol of the alphabet is obtained by scanning the input sequences c = Σ ε is an user parameter Output of the Simpleut algorithm for the ith grid node: the number t of cuts gives the depth t + 1 of the tree where the partition is defined an interval I i corresponding to tree nodes at depth t + 1 assigned to the ith grid node omputação em Sistemas Distribuídos 2003 p.22/27

Parallelization Reducing the tree partition problem to the PRIION UP O ε problem Input of the Simpleut algorithm for the ith grid node: l = Σ r matches the number of grid nodes w j of each symbol of the alphabet is obtained by scanning the input sequences c = Σ ε is an user parameter Output of the Simpleut algorithm for the ith grid node: the number t of cuts gives the depth t + 1 of the tree where the partition is defined an interval I i corresponding to tree nodes at depth t + 1 assigned to the ith grid node Simpleut(Partition i, lphabetsize l, ridnodes r, Weights w j, lphabetsize c, WorkOverload ε) 1. find the smallest t such that max w j c t 2. let t = min(depth(m) - 1,t ) ε omputação em Sistemas Distribuídos 2003 p.22/27

Parallelization j 1 2 3 4 σ j w j 2 1 1 2 r = 5 ε = 1 omputação em Sistemas Distribuídos 2003 p.23/27

Parallelization j 1 2 3 4 σ j w j 2 1 1 2 r = 5 ε = 1 t = 1 t = 1 1. find the smallest t such that max w j 2. t = min(depth(m) - 1,t ) c t ε omputação em Sistemas Distribuídos 2003 p.23/27

Parallelization j 1 2 3 4 σ j w j 2 1 1 2 3. for each j[ 1,..., l j 1 4. V j = k=1 w k c t, ) j k=1 w k c t r = 5 ε = 1 t = 1 t = 1 V1=[0,8) V2=[8,12) V3=[12,16) V4=[16,24) omputação em Sistemas Distribuídos 2003 p.23/27

Parallelization j 1 2 3 4 σ j w j 2 1 1 2 r = 5 ε = 1 t = 1 t = 1 5. w = l j=1 w j 6. γ = w c t mod r 7. δ = w ct r V1=[0,8) V2=[8,12) V3=[12,16) V4=[16,24) omputação em Sistemas Distribuídos 2003 p.23/27

Parallelization j 1 2 3 4 σ j w j 2 1 1 2 { 8. I i [(i 1)(δ + 1), i(δ + 1) ) = for all i γ [γ(δ + 1) + (i (γ + 1))δ, γ(δ + 1) + (i γ)δ ) otherwise r = 5 ε = 1 t = 1 t = 1 w = 6 γ = 4 δ = 4 V1=[0,8) V2=[8,12) V3=[12,16) V4=[16,24) I 1=[0,5) I 2=[5,10) I 3=[10,15) I 4=[15,20) I 5=[20,24) omputação em Sistemas Distribuídos 2003 p.23/27

Parallelization j 1 2 3 4 σ j w j 2 1 1 2 r = 5 ε = 1 t = 1 t = 1 w = 6 γ = 4 δ = 4 9. transform I i = [a, b) into I i = [f(a), f(b)) with { (j 1) c t + x inf(v j ) for all x V f(x) = w j j l c t if x = w c t V1=[0,8) V2=[8,12) V3=[12,16) V4=[16,24) I 1=[0,5) I 2=[5,10) I 3=[10,15) I 4=[15,20) I 5=[20,24) I1=[0,2) I2=[2,6) I3=[6,11) I4=[11,14) I5=[14,16) omputação em Sistemas Distribuídos 2003 p.23/27

Parallelization PExtractModels(Model m, Block i, PartitionSet I i of M) 1. for each node-occurrence v of m 2. if (i > 1) 3. put in P otencialstarts the children of v at levels (i 1)k + (i 1)d mini 1 to (i 1)k + (i 1)d maxi 1 4. else 5. put v in P otencialstarts 6. for each model m i I i obtained by doing a recursive depth-first traversal from the root of the virtual model tree M while simultaneously traversing from the node-occurrences in P otencialstarts 7. if (i < p) 8. PExtractModels(m = m 1... m i,i + 1, I i ) 9. else 10. KeepModel( (m 1,..., m p ), ((d min1, d max1 ),..., (d minp, d maxp )) ) omputação em Sistemas Distribuídos 2003 p.24/27

Parallelization PSmile(ridNode i, WorkOverload ε) 1. compute weights (w i ) 1 i Σ ; 2. build suffix tree ; 3. create colors on ; 4. let I i = Simpleut(i, Σ, r, (w i ) 1 i Σ, Σ, ε); 5. call PExtractModels(, I i ); Proposition. ssume Σ fixed and w i = 1 for 1 i Σ. he parallel algorithm PSmile is work-efficient with respect to the sequential version when r = O(ν p 2 (e, k)) and ε w 1 r. omputação em Sistemas Distribuídos 2003 p.25/27

Parallelization Experimental results 2 boxes 3 boxes models time (sec) models time (sec) grid node 1 2 155.83 9987 444.50 grid node 2 1 168.11 6178 385.28 grid node 3 2 245.35 3108 473.70 grid node 4 16 262.51 15884 581.64 total 21 831.80 35157 1885.18 parallel time 262.51 581.64 sequential time 757.97 1790.70 speed up 2.9 3.1 omputação em Sistemas Distribuídos 2003 p.26/27

On going and future work Implementation of a more efficient sequential algorithm to extract structured models [L. Marsan and M.-F. Sagot, J. omputational Biology, 2000] Establishing an even more efficient algorithm to extract structured models [. arvalho,. Freitas,. Oliveira and M.-F. Sagot, in preparation, 2003] omparison between algorithms which attempts to model the combinatorics of regulation omputação em Sistemas Distribuídos 2003 p.27/27