A formal model for databases in DNA

A forml model for dtses in DNA Joris J.M. Gillis nd Jn Vn den Bussche Hsselt University nd trnsntionl University of Limurg Astrct Our gol is to etter understnd, t theoreticl level, the dtse spects of DNA computing. Thereto, we introduce formlly defined dt model of so-clled sticker DNA complexes, suitle for the representtion nd mnipultion of structured dt in DNA. We lso define DNAQL, restricted progrmming lnguge over sticker DNA complexes. DNAQL stnds to generl DNA computing s the stndrd reltionl lger for reltionl dtses stnds to generl-purpose conventionl computing. The numer of opertions performed during the execution of DNAQL progrm, on ny input, is only polynomil in the dimension of the dt, i.e., the numer of its needed to represent single dt entry. Moreover, ech opertion cn e implemented in DNA using constnt numer of lortory steps. We prove tht the reltionl lger cn e simulted in DNAQL. 1 Introduction In DNA computing [16, 3], dt re represented using synthetic DNA molecules in vitro. Opertions on dt re performed y iotechnologicl mnipultions of DNA tht re sed on DNA self-ssemly (Wtson Crick se piring) or on explicit effects upon DNA y specific enzymes. In the originl pproch to DNA computing, which we could cll the Adlemn model [2, 5, 18], one uses more or less stndrd repertoire of opertions on DNA, where ech opertion corresponds to fixed numer of steps in the lortory. (These steps could e performed y humn or y root.) In more recent yers, reserch in DNA computing is lrgely focusing on the gol to let n entire computtion hppen y self-ssemly lone, without (or with miniml) outside intervention, e.g., [23, 22, 11]. Wheres the pure self-ssemly model is very ttrctive, it is hrder to chieve in prctice, nd indeed this is the suject of lot of current reserch in the re of DNA nnotechnology. Menwhile, the originl Adlemn model deserves further study, nd in this pper we hve renewed look t the Adlemn model, specificlly from the perspective of dtses. Indeed, DNA computing is very ttrctive from the dtse perspective: the nnoscle nd roustness of DNA molecules re promising from dt storge point of view, nd the highly prllel opertions of DNA computing correspond well with the ulk dt processing nture typicl of dtse query processing [1]. Most erlier theoreticl work on the possiilities of DNA computing hs focused either on the ility to mimick clssicl models of computtion such s finite utomt or Turing mchines, or on the reltionship with prllel computtion, ut this lwys from generl-purpose computing perspective. In contrst, in dtse theory, one considers restricted models of computtion, limited in computtionl power ut still with sufficient expressiveness for structured dtse mnipultion. The clssicl model is the reltionl lger for reltionl dtses [1]. This lger consists of six opertions on reltions (dtse tles): union; difference; crtesin product; selection; projection; nd renming. These opertions cn e composed to form expressions. These express dtse queries, nd the reltionl lger cn express precisely those dtse queries tht cn e defined in first-order logic, thus providing well-delineted restriction in computtionl power. The enefit of restricted computtionl models is tht they fcilitte the identifiction of optimistion strtegies for more efficient processing; hence there exists lrge ody of techniques for dtse query processing, e.g., [17]. From the point of view of theoreticl science, n dded enefit of restricted Ph.D. fellow of the Reserch Foundtion Flnders (FWO). 1

computtionl model is tht it llows us to study nd ttempt to chrcterise the precise computtionl ilities of the computtionl systems tht re eing modeled (such s reltionl dtse systems). Motivted y the ove considertions, in this pper, we wnt to propose solution to the following eqution: reltionl dtses nd reltionl lger generl-purpose conventionl computing =? DNA computing (Adlemn model) We define forml dt model of sticker complexes, which represent complexes of DNA molecules. Our complexes re generl enough to serve s dt structures for structured dt such s found in reltionl dtses. At the sme time, however, sticker complexes re restricted so tht we void the complictions connected to the difficult secundry structure prediction prolem of generl DNA complexes [14]. Indeed, our min contriution consists in formlly defining well-ehved fmily of DNA-complex dt structures, with n ccompnying set of opertions on these dt structure tht preserve the wellehvedness restrictions. We fit the opertions into first-order query lnguge, clled DNAQL, with forml opertionl semntics. We thus propose the sticker complex dt model, together with DNAQL, s the DNA computing nlogues of the reltionl dtse model nd the ccompnying reltionl lger. Restrictive s sticker complexes nd DNAQL my e, we prove tht they cn still simulte the reltionl dt model nd the reltionl lger. At the sme time, we stress tht our new DNA dtse model should lso e pprecited in its own right s restricted model of DNA computing specilised to dtse mnipultion. This pper is orgnised s follows. Section 2 discusses relted work. Section 3 defines the dt model. Section 4 introduces importnt opertions on sticker complexes. Section 5 discusses the representtion of structured dt using complexes. Section 6 discusses the implementtion in DNA of the opertions. Section 7 defines the query lnguge DNAQL. Section 8 presentes the simultion of the reltionl lger in DNAQL. We conclude in Section 9. 2 Relted work Our work cn e seen s followup of Reif s originl work [18] on relting DNA computing with conventionl prllel computing. Indeed, Reif lso formlized DNA complexes nd considered similr opertions. Our model specilizes Reif s model to dtse model. For exmple, it is well known [1, 12] tht the dt complexity of the reltionl lger (first-order logic) elongs to the prllel circuit complexity clss AC 0, denoting constnt-depth, polynomil-size circuits with unounded fn-in. Likewise, DNAQL progrms execute numer of opertions on complexes tht re lrgely independent of the dt size, except for polynomil dependence on the numer of its needed to represent single dt entry, numer we cll the dimension of the dt. Moreover, s usul for DNA computing, ech opertion works in prllel on the different DNA strnds present in complex, nd ech opertion cn e implemented in rel DNA in constnt numer of lortory steps. Our work lso clerly fits in recent trend in DNA computing to identify specilised computtionl models within the generl frmework of DNA computing. This trend is nicely exemplified y the work y Crdelli [6] nd Mjumder nd Reif [15], where the specilised computtionl model is tht of process lgers; in our work, it is tht of dtses. While our work is not the first to relte the reltionl lger with DNA computing, we re the first to do it formlly nd in detil. An revited ccount of chieving reltionl lger opertions through DNA mnipultion ws given recently y Ymmoto et l. [26], ut unfortuntely tht pper is too sketchy to llow ny comprison with our pproch. In contrst, our own methods re fully formlised, nd importntly, our work identifies restrictions on DNA computing within which reltionl 2

lger simultion remins possile. More influentil to our work is the older work y Arit et l. [4] demonstrting how one cn ccomplish conctention nd rottion of DNA strnds. Such mnipultions, which involve circulr DNA, re crucil in our model, nd indeed were lredy crucil to Reif [18]. Finlly, we mention the erlier works on DNA memories [19, 7], which, while hving dtse flvor, re primrily out supporting serching in sets of DNA strnds nd lrgely ignore the more complex opertions of the reltionl lger such s difference, projection, crtesin product, nd renming. 3 The sticker-complex dt model In this section we formlly define fmily of dt structures which we cll sticker complexes. They re n strction of complexes of DNA strnds. Reif [18] lredy defined similr dt structure, ut our definition introduces severl limittions so s to void unrelistic or otherwise complicted nd unmngele secundry structures. The djective sticker points to our restriction of hyridiztion to short primers (which we cll negtive strnds) for the recognition nd splicing of the strnds crrying the ctul dt (clled the positive strnds). Bsiclly, we ssume the following disjoint, finite lphets: Λ of tomic vlue symols; Ω of ttriute nmes; nd Θ = {# 1,# 2,# 3,# 4,# 5,# 6,# 7,# 8,# 9 } of tgs. The union of these three lphets is denoted y Σ nd clled the positive lphet. Furthermore, we use negtive lphet, denoted Σ, disjoint from Σ, defined s Σ = {ā Σ}. Thus there is ijection etween Σ nd Σ, which is clled complementrity nd is denoted y overlining symol; we set ā =. We will first define pre-complexes tht contin the overll structure of sticker complexes. Sticker complexes will then e defined s pre-complexes stisfying vrious restrictions. A pre-complex is finite, edge-leled directed grph where the edges represent ses in strnds, nd the nodes represent the endpoints etween the ses in strnd. Moreover, pre-complex is equipped with mtching, representing se piring, nd two predictes. One predicte indictes which ses re immoilized, i.e., do not flot freely nd cn e seprted from solution in controlled mnner; the other predicte indictes which ses re locked, i.e., cnnot prticipte in se piring. Formlly, pre-complex is 6-tuple (V, E, λ, µ, immo, locked) such tht: 1. V is finite set of nodes, 2. E V V is finite set of directed edges without self-loops, 3. λ : E Σ Σ is totl function leling the edges, 4. µ [E] 2 = {{e,e } e,e E nd e e } is prtil mtching on the edges, i.e., ech edge occurs in t most one pir in µ, 5. immo E, 6. locked E. Let C e pre-complex s ove. We introduce the notion of strnd nd component of C s follows. A strnd of C is simply connected component of the directed grph (V,E). Furthermore, we sy two strnds s nd s re onded if there exists some edge e in s nd some edge e in s with {e,e } µ. When two strnds re connected (possily indirectly) y this onding reltion, we sy they elong to the sme component. Thus, component of pre-complex is sustructure formed y mximl set of strnds connected y the onding reltion. A sticker complex now is pre-complex stisfying the following restrictions: 3

Figure 1: On the left, complex with two strnds spelling the words nd ā nd the expected complementry se piring. On the right, complex with two strnds spelling the words nd ā nd folded se piring. Dotted lines denote edges mtched y µ. 1. There re no isolted nodes, i.e., ech node occurs in t lest one edge. 2. Ech node hs t most one incoming nd t most one outgoing edge. Thus, ech strnd hs the form of chin or cycle. 3. The lels on chin re homogeneous, in the sense tht either ll edges re leled with positive symols or ll edges re leled with negtive symols. Nturlly, strnd with positive (negtive) symols is clled positive (negtive) strnd. 4. Negtive strnds re severely restricted: specificlly, every negtive strnd must e chin of one or two edges. 5. Mtchings y µ cn only occur etween complementrly leled edges. 6. An edge cn e immoilized only if it is the sole edge of negtive strnd. 7. Edges in locked do not occur in µ. 8. Ech component cn contin t most one immoilized edge. Henceforth, for simplicity, we will refer to sticker complexes simply s complexes. We remrk tht the predicte locked nd the mtching µ serve to strct two different fetures of doule-strndedness. The mtching µ is used to mke explicit where the stickers (short negtive strnds) pir with the positive strnds. The predicte locked represents longer stretches of doule strnds. As in the work y Rozenerg nd Spink [20], locking is used to restrict the plces where hyridiztion cn still occur. We lso remrk tht it is not necessry to require tht edges mtched y µ run in opposite directions (in ccordnce with the opposite 5 3 nd 3 5 directions of doule-strnded DNA). This is ecuse stickers of length one cn trivilly e plced in the desired direction, nd stickers of length two cn lwys fold so s to e gin in the desired direction. The ltter is illustrted in Figure 1. Redundncy in complexes. In prctice, test tue will contin mny duplicte strnds, nd indeed this multiplicity is typiclly crucil for DNA computing to work. Accordingly, in our model, ech component of complex stnds for possily multiple occurrences. (This importnt issue is not ddressed in Reif s formlistion of complexes [18].) In order to formlize this, we define the notions of susumption, equivlence, redundnt extension, nd minimlity. A complex C is sid to susume complex C if for ech component D of C, there exists n isomorphic component D in C. Two complexes C nd C re sid to e equivlent if they susume ech other. When C is equivlent to C nd n extension of C, we cll C redundnt extension of C. 4

A component D of complex C is clled redundnt if some other component of C is isomorphic to D. Note tht removing redundnt component from C yields complex tht is still equivlent to C. A complex tht hs no redundnt components is clled miniml. Nturlly, ech complex C hs unique (up to isomorphism) miniml complex C tht is equivlent to C; we cll C the minimiztion of C. 4 Opertions on complexes In this section, we formlly define set of opertions on complexes tht re rther stndrd in the DNA computing literture, except perhps the difference. But wht is interesting, however, is tht we hve defined sticker complexes in such wy tht ech opertion lwys result in sticker complex when pplied to sticker complexes. Moreover, the difference opertion imposes dditionl restrictions on its input so s to gurntee effective implementility in rel DNA (discussed in Section 6). As generl proviso, in the following definitions, finl minimiztion step should lwys e pplied to the result so s to otin mthemticlly deterministic opertion. In the following definitions we keep this implicit so s not to clutter up the presenttion. Also, it is understood tht the result of ech opertion is defined up to isomorphism. Union. Let C 1 = (V 1,E 1,λ 1, µ 1,immo 1,locked 1 ) nd C 2 = (V 2,E 2,λ 2, µ 2,immo 2,locked 2 ) e two complexes. W.l.o.g. we ssume tht V 1 nd V 2 re disjoint. Then the union C 1 C 2 equls (V 1 V 2,E 1 E 2,λ 1 λ 2, µ 1 µ 2,immo 1 immo 2,locked 1 locked 2 ). Difference. Let C 1 nd C 2 e two complexes tht stisfy the following conditions: 1. µ 1 = immo 1 = locked 1 = /0 = µ 2 = immo 2 = locked 2, i.e., ll components in C 1 nd C 2 re single strnds. 2. All strnds of C 1 nd C 2 re positive, noncirculr, nd ll hve the sme length. 3. Ech strnd of C 2 ends with # 4 nd does not contin # 5. Then the difference C 1 C 2 equls the union of ll strnds in C 1 tht do not hve n isomorphic copy in C 2. If C 1 nd C 2 do not stisfy the ove conditions then C 1 C 2 is undefined. Hyridize. Let C = (V,E,λ,µ,immo,locked) nd C = (V,E,λ, µ,immo,locked ) e two complexes. We sy tht C is hyridiztion extension of C if V = V, E = E, λ = λ, immo = immo, locked = locked nd µ is n extension of µ. Bewre tht hyridiztion extension must stisfy ll conditions from the definition of sticker complex. A complex C is sid to hve mximl mtching if the only hyridiztion extension of C is C itself. The notion of hyridiztion extension is not sufficient, however, since we wnt to llow duplicte copies of components in C to prticipte in hyridiztion. (This importnt issue is glossed over in Reif s formlistion [18].) To formlize this ehvior, let us cll C (with mtching µ ) multiplying hyridiztion extension (MHE) of C if C is hyridiztion extension, with mximl mtching, of some redundnt extension C of C. Moreover, we cll component D of n MHE unfinished if there exist nother MHE in which D occurs onded within lrger component. We then cll n MHE sturted if it hs no unfinished components. This is illustrted in Figure 2. Finlly we sy tht C hs recursion-free hyridiztion if there exists only finite numer of sturted hyridiztion extensions of C. 5

c d c d c c c d c Figure 2: Left: complex C; top right: hyridiztion extension of C with mximl mtching, ut not sturted in view of the MHE of C shown ottom right; tht MHE is sturted. Dotted lines denote edges mtched y µ. Figure 3: A complex (top) nd one of its MHE s (ottom). Dotted lines denote edges mtched y µ. Note tht the MHE forms ring structure. On the other hnd, we do not wnt hyridiztion to go off into n uncontrolled chin rection. Indeed, our very gol in this pper is to explore first-order or recursion-free version of DNA computing, in line with the first-order nture of the reltionl lger [1]. Thus we wnt to sty wy from recursive self-ssemly DNA computtions. Formlly, we wnt to rule out the situtions where there re infinitely mny possile non-equivlent MHE s. Such situtions re very well possile. Consider, for simple exmple, the complex C consisting of two non-circulr strnds spelling out the words nd ā. Tking n copies of nd n copies of ā, we cn form ritrry long non-equivlent MHE s of C. An illustrtion for n = 3 is given in Figure 3. Formlly, we sy tht C hs recursion-free hyridiztion if their re only finitely mny sturted 6

Tle 1: The llowed split points. Lel Intercting Plce # 2 flse efore # 3 flse efore # 4 flse fter # 6 true fter # 8 true efore MHE s of C. If this is the cse, we define hyridize(c) to equl the disjoint union of ll sturted MHE s of C. If C does not hve recursion-free hyridiztion, we consider hyridize(c) to e undefined. For exmple, it cn e verified tht the complex from Figure 2 hs recursion-free hyridiztion. Ligte. The ligte opertor conctentes strnds tht re held together y sticker. Formlly, define gp s set of four edges {e 1,e 2,e 3,e 4 } such tht {e 1,e 4 } µ; {e 2,e 3 } µ; e 1 nd e 2 (in tht order) re consecutive edges on negtive strnd; e 3 is the lst edge on its (positive) strnd; nd e 4 is the first edge on its (positive) strnd. By filling gp we men modifying the complex so tht the endnode of e 3 nd the strtnode of e 4 re identified. We now define ligte(c) s the complex otined from C y filling ll gps. Flush. Quite simply flush(c) equls the complex otined from C y removing ll components tht do not contin n immoilized edge. Split. Consider node u in some complex C. By splitting C t u, we men the following. If u hs n incoming (outgoing) edge, denote it y e 1 (e 2 ). If oth e 1 nd e 2 exist, then replce u y two nodes u 1 nd u 2, letting e 1 rrive in u 1, nd letting e 2 strt in u 2. Furthermore, if there exists node u with incoming edge e 4 nd outgoing edge e 3, such tht {e 1,e 3 } µ or {e 2,e 4 } µ, then u is lso split in n nlogous mnner. Also, n edge is clled intercting if it neither occurs in locked nor in µ. Now consider the set of triples shown in Tle 1. Ech triple is clled splitpoint nd hs the form (lel, intercting, plce). By splitting C t such splitpoint, we men splitting C t ll strtnodes (if plce is efore ) or endnodes (otherwise) of edges leled lel, on condition tht the edge is intercting (or nonintercting, depending on the oolen vlue intercting). The result is denoted y split(c, lel). Blocking. There re two locking opertions. Here we ssume tht C is sturted in the sense tht C is equivlent to hyridize(c); if this condition is not stisfied then the locking opertions on C re considered to e undefined. The simplest opertion is lock(c,σ), for ny σ Σ, which equls the complex otined from C y dding ll edges leled σ to locked. For the other opertion, let gin e σ Σ, nd consider ny contiguous sustrnd s in C. We cll s σ-locking rnge if it stisfies three conditions. Firstly, ll edges of the sustrnd re intercting (in 7

the sense of the previous prgrph). Secondly, either the sustrnd contins the first edge of its strnd, or the edge preceding the first edge of the sustrnd is locked. Thirdly, the lst edge of the sustrnd is leled with σ. Now we define lockfrom(c,σ) to e the complex otined from C y dding to locked ll edges ppering in some σ-locking rnge. Clenup. The clenup opertor undoes mtchings nd lockings nd removes ll strnds except for the longest positive strnds. Here we ssume the condition tht every positive strnd in C is t lest three long, nd hs t lest one intercting edge; if C does not stisfy this condition, clenup(c) is not defined. Otherwise, clenup(c) equls the union of ll positive strnds of C of mximl length; there re no mtched nd no locked edges in clenup(c). 5 Dt representtion When we wnt to represent structured dt s sticker complexes, the symols from the lphet Σ = Λ Ω Θ will e used in different wys. Attriutes (Ω) will e used to indicte the structure of the dt; tgs (Θ) will e used s seprtors nd uxiliry mrkers in dt mnipultion. Atomic vlue symols (Λ) will e used to represent the ctul dt entries. However, since Λ is just finite lphet typiclly of smll size, we will need to use strings (or vectors) of tomic vlue symols to represent dt entries, just like words of its re used in conventionl computing to represent dt entries like chrcters or integers. In nlogy to the word length of conventionl computer processor, in our pproch we ssume some dimension l, nturl numer, is known. Then every dt entry is encoded y n l-vector of tomic dt symols. Formlly, we sy tht sticker complex C hs dimension l if every edge e leled y some (positive) tomic vlue symol is prt of sequence (e 0,e 1,...,e l,e l+1 ) of l + 2 consecutive edges, where e 0 is leled # 3 ; ech e i for i = 1,...,l is leled with positive tomic vlue symol; nd e l+1 is leled # 4. So, e is one of the e i s with i {1,...,l}. We cll (e 0,e 1,...,e l,e l+1 ) n l-vector in C. A complex of dimension l is lso clled n l-complex. We lso introduce n dditionl locking opertor on l-complexes. Let n e nturl numer nd let C e complex stisfying the following conditions: 1. C is n l-complex with l n; 2. in every l-vector in C, either ll edges re locked or no edge is locked; 3. C is equivlent to hyridize(c). Then lockexcept(c,n) equls the complex otined from C y locking, within ech l-vector (e 0, e 1,...,e l,e l+1 ) tht is not yet locked, ll edges except e n. If (C,n) does not stisfy the conditions ove, then lockexcept(c,n) is undefined. 6 Implementtion in DNA In this section, we rgue tht the strct sticker complexes nd the opertions on them presented ove cn e implemented y rel DNA complexes. Our discussion remins theoreticl s we hve not performed lortory experiments. On the one hnd, our min purpose is to mke the strct model plusile s theoreticl frmework in which the possiilities nd limittions of DNA computing s dtse model; on the other hnd, we use only rther stndrd iotechnologicl techniques. 8

Ech component of n strct complex is represented y lrge surplus of duplicte copies in DNA. Ech positive lphet symol from Σ is implemented y strnd of (single-strnded) DNA, such tht the resulting set of DNA strnds forms set of DNA codewords [8, 21, 24]. If the DNA strnd for symol Σ is w, then the DNA strnd for the complementry symol ā, is, nturlly, the Wtson-Crick complementry strnd to w. Then, mtching of edges y µ in n strct complex is implemented y se piring in the DNA complex. We will see elow how locking is implemented. Immoiliztion is implemented s is stndrd in DNA computing y ttchment to surfces [13] or mgnetic eds. The union opertion mounts to mixing two test tues together. The difference C 1 C 2 of complexes cn e implemented y sutrctive hyridiztion technique [10]. Let C 1 (C 2 ) e stored in test tue t 1 (t 2 ). Becuse ll strnds in t 2 end in # 4, we cn esily ppend # 5 to them. Next we dd to t 2 n undnce of immoilized short primers # 5. Using polymerse we otin complements to ll strnds in t 2, still immoilized, so tht it is now esy to seprte them. It remins to use these complements to remove ll strnds from t 1 tht occured in t 2. Since ll strnds hve the sme length, prtil hyridiztion, leding to flse removls, cn e voided y using very precise melting temperture sed on the precise length of the strnds. Hyridiztion hppens nturlly nd is merely controlled y temperture. Still, we must rgue tht the result still stisfies the definition of sticker complex. The only peculirity in this respect is the requirement tht ech component cn contin t most immoilized edge. Since immoilized edges re implemented y strnds ffixed to surfces, implying some miniml distnce etween such strnds, it seems resonle to ssume tht the lrge mjority of hyridiztion rections will occur mong freely floting strnds, or etween freely floting nd immoilized ones. Recursion-free hyridiztion is very hrd to control y nture. It will e the responsility of the lgorithm designer to design DNAQL progrms (see Section 7) tht, on the intended inputs, will pply hyridize only to inputs tht hve recursion-free hyridiztion. Our simultion of the reltionl lger in DNAQL (see Section 8) is well-defined in this sense. Splitting is chieved s usul y restriction enzymes. A feture of our strct model is tht we require only five recognition sites (Tle 1). Of course, these recognition sites will hve to e integrted in the DNA codeword design. Blocking is implemented y mking strnds doule-strnded, so tht they cnnot e involved in lter hyridiztions. The ordinry lock opertion cn e implemented y dding the pproprite primer which will nnel to the desired sustrnds thus locking the corresponding edges. As in the Snger sequencing method, however, the se t the 3 end of the primer is modified to its dideoxy-vrint. In this wy unwnted interction with polymerse from possile lter lockfrom opertions is voided. Indeed, lockfrom is implemented using polymerse. For the lockexcept opertion to work, we need to dpt the implementtion of l-vector strnds #3v 1...v l # 4, with v i Λ for i = 1,...,l, y introducing dditionl mrkers φ i, so tht we get # 3 φ 1 v 1...φ l v l # 4. These l dditionl mrkers must e prt of the set of codewords. We cn then implement lockexcept(., n) y the composition lock(.,# 3 );lockfrom(.,φ n 1 );lock(.,φ n+1 );lockfrom(.,# 4 ). The clenup opertion strts y denturing (wrming up) the tue. Immoilized strnds re removed from the tue. Next gel electrophoresis is crried out to seprte the longest DNA molecules from the other molecules. Thnks to the conditions we hve imposed on inputs to clenup, the result of this seprtion is either empty or consists of positive DNA molecules. 7 DNAQL In this section we define limited functionl progrmming lnguge, DNAQL, for expressing functions from l-complexes to l-complexes. A crucil feture of DNAQL is tht the sme progrm cn e pplied 9

expression ::= complexvr forech if let opertor constnt complexvr forech ::= for complexvr := expression iter counter do expression if ::= if empty( complexvr ) then expression else expression let ::= let x := expression in expression opertor ::= (( expression ) ( expression )) (( expression ) ( expression )) hyridize( expression ) ligte( expression ) flush( expression ) split( expression, splitpoint ) lock( expression, Σ) lockfrom( expression, Σ) lockexcept( expression, counter ) clenup( expression ) constnt ::= Σ + ( Σ Λ )( Σ Λ ) immo(σ) leftoot rightoot empty splitpoint ::= # 2 # 3 # 4 # 6 # 8 Figure 4: Syntx of DNAQL. # 1 # 2 # 5 # 1 # 4 # 5 Figure 5: Left- nd right-oot-shped complexes. uniformly to complexes of ny prticulr dimension l. DNAQL is not computtionlly complete, s it is ment s query lnguge nd not generl-purpose progrmming lnguge. The lnguge is sed on the opertions on complexes introduced erlier, nd dds to this the following fetures: some distinguished constnts; n emptiness test (if-then-else); let-vrile inding; counters tht cn count up to the dimension of the complex; nd limited for-loop for iterting over counter. The syntx of DNAQL is given in Figure 4. Note tht expressions cn contin two kinds of vriles: vriles stnding for complexes, nd counters, rnging from 1 to the dimension. Complex vriles cn e ound y let-constructs, nd counters cn e ound y for-constructs. The free (unound) complex vriles of DNAQL expression stnd for its inputs. A DNAQL progrm is DNAQL expression without free counters. So, in progrm, ll counters re introduced y for-loops. The constnts hve the following mening s prticulr complexes: A word w Σ + stnds for single, liner, positive strnd tht spells the word w. A two-letter word ā, for, Σ Λ, stnds for single, liner, negtive strnd of length two of the form 1 2 3. ā immo(ā), for Σ, stnds for single, negtive, immoilized edge leled ā. leftoot nd rightoot re illustrted in Figure 5. empty stnds for the empty complex, i.e., the complex with the empty set of nodes. 10

x is complex vrile [x](β,γ) = β(x) [e 1 ](β,γ) = C 1 [e 2 ](β,γ) = C 2 [e 1 e 2 ](β,γ) = C 1 C 2 [e 1 ](β,γ) = C 1 [e 2 ](β,γ) = C 2 C 1 C 2 is well-defined [e 1 e 2 ](β,γ) = C 1 C 2 [e ](β,γ) = C [hyridize(e )](β,γ) = hyridize(c ) [e ](β,γ) = C [flush(e )](β,γ) = flush(c ) [e ](β,γ) = C [ligte(e )](β,γ) = ligte(c ) [e ](β,γ) = C σ {# 2,# 3,# 4,# 6,# 8 } [split(e,σ)](β,γ) = split(c,σ) [e ](β,γ) = C lock(c,σ) is well-defined [lock(e,σ)](β,γ) = lock(c,σ) [e ](β,γ) = C lockfrom(c,σ) is well-defined [lockfrom(e,σ)](β,γ) = lockfrom(c,σ) [e ](β,γ) = C i is counter lockexcept(c,γ(i)) is well-defined [lockexcept(e,i)](β,γ) = lockexcept(c,γ(i)) [e ](β,γ) = C clenup(c ) is well-defined [clenup(e )](β,γ) = clenup(c ) [e 1 ](β,γ) = C 1 [e 2 ](β[x := C 1 ],γ) = C 2 [let x := e 1 in e 2 ](β,γ) = C 2 [e 1 ](β,γ) = C 1 β(x) is the empty complex [if empty(x) then e 1 else e 2 ](β,γ) = C 1 [e 2 ](β,γ) = C 2 β(x) is not the empty complex [if empty(x) then e 1 else e 2 ](β,γ) = C 2 [e 1 ](β,γ) = C 0 [e 2 ](β[x := C n 1 ],γ[i := n]) = C n for n = 1,...,l [for x := e 1 iter i do e 2 ](β,γ) = C l Figure 6: Semntics of DNAQL The semntics of DNAQL expression e is defined reltive to context consisting of dimension l, n l-complex ssignment β, nd n l-counter ssignment γ. An l-complex ssignment is mpping from complex vriles to l-complexes; n l-counter ssignment is mpping from counters to {1,...,l}. Nturlly, β must e defined on ll free vriles of e, nd γ must e defined on ll free counters of e. Within such context, the expression cn evlute to n l-complex, denoted y [e] l (β,γ). The semntic rules tht define this evlution re shown in Figure 6. The superscript l hs een omitted to reduce clutter. The rules for let nd for use the oft-used nottion f [x := u] to denote the mpping f updted so tht x is mpped to u. Becuse the opertions on complexes re not lwys defined, the evlution my fil, so [e] l (β,γ) my e undefined. When e is progrm, we denote [e](β, /0) simply y [e](β). 11

8 Simultion of the reltionl lger Let us first recll some sic definitions concerning the reltionl dt model. Bsiclly we ssume universe U of dt elements. A reltion schem R is finite set of ttriutes. A tuple over R is mpping from R to U. A reltion over R is finite set of tuples over R. A dtse schem is mpping D on some finite set of reltion vriles tht ssigns reltion schem to ech reltion vrile. An instnce of D is mpping I on the sme set of reltion vriles tht ssigns to ech reltion vrile x reltion over D(x). The syntx of the reltionl lger [1] is generted y the following grmmr: e ::= x (e e) (e e) (e e) σ A=B (e) π A (e) ρ A/B (e) Here, x stnds for reltion vrile, nd A nd B stnd for ttriutes. Our version of the reltionl lger is slightly nonstndrd in tht our version of projection ( π) projects wy some given ttriute, s opposed to the stndrd projection which projects on some given suset of the ttriutes. The semntics of the reltionl lger is well known nd we omit forml definition. A reltionl lger expression e cn e evluted in the context of some dtse instnce I tht is defined on t lest the reltion vriles occurring in e. When the evlution succeeds, e evlutes to reltion denoted y [e](i). (The evlution of reltionl lger opertor my fil due to mismtches etween the ttriutes present in the rgument reltions nd the ttriutes expected y the opertor [25].) We wnt now to represent reltions y complexes. We will store dt elements s vectors of tomic vlue symols. So formlly, we use Λ s our universe U. Then tuple t (reltion r, instnce I) is sid to e of dimension l if ll dt elements ppering in t (r, I) re strings of length l. Let t e tuple of dimension l over reltion schem R. We my ssume fixed order on the ttriutes of R, sy, A,...,B. We then represent t y the following l-complex: (using the constnt nottion of DNAQL) complex(t) = # 2 A# 3 t(a)# 4...# 2 B# 3 t(b)# 4. A reltion r of dimension l is then represented y the l-complex {complex(t) t r} which we denote y complex(r). Moreover, dtse instnce I of dimension l cn e represented y the l-complex ssignment complex(i) tht mps ech reltion vrile x (used s complex vrile) to complex(i(x)). We re now in position to stte our min theorem. Theorem 1. Let some dtse schem D e fixed. Every reltionl lger expression e cn e trnslted into DNAQL progrm e DNA, such tht for ech nturl numer l nd for ech l-dimensionl dtse instnce I over D, if [e](i) is defined, then so is [e DNA ] l (complex(i)), nd (up to isomorphism). complex([e](i)) = [e DNA ] l (complex(i)) For the proof we introduce few useful revitions. For, Σ, we use lockfromto(x,,) to revite lockfrom(lock(x,),). For ttriutes A nd B, we use circulrize(x,a,b) to revite clenup(ligte(hyridize(hyridize(lockfromto(x 6,B,A) immo(# 3 )) # 4 # 2 ))) If x holds complex of the form complex(r) for some reltion r over schem with first ttriute A nd lst ttriute B, then circulrize(x, A, B) will equl the complex otined from x y circulrizing every strnd [18, 4]. The proof now goes y induction on the structure of e. 12

Union, difference. If e is e 1 e 2, then e DNA = e DNA 1 e DNA 2. If e is e 1 e 2, then e DNA = e DNA 1 e DNA 2. Crtesin product. Let e e of the form e 1 e 2 with e 1 over reltion schem R nd e 2 over disjoint reltion schem S. Let A e the first nd B e the lst ttriute of R nd let C e the first nd D e the lst ttriute of S. Consider the following DNAQL progrm e : let x := e DNA 1 in let y := e DNA 2 in where e 4 is given y the following: if empty(x) then empty else if empty(y) then empty else e 4 e 4 := clenup(split(split(lockfromto(e 5,B,C),# 2 ),# 4 )) e 5 := circulrize(e 6,A,D) e 6 := clenup(ligte(hyridize[x 6 x 6 # 5# 1 ])) x 6 := clenup(ligte(hyridize(x rightoot))) x 6 := clenup(ligte(hyridize(y leftoot))) Prts e 6 nd e 6 ttch unique ending (eginning) to the tuples in r (s). The new tuples re dded together, in x 6, long with sticky ridge (# 5 # 1 ), resulting in ll possile joins of tuples of e DNA 1 nd e DNA 2. The rest of the expression is concerned with cutting out the # 5 # 1 piece in the middle of the new chins nd getting the old e DNA 1 -tuples ck in front of the new tuples. The progrm e is not yet quite correct, however, since we ssume tht the ttriutes in complex representtions of tuples re ordered in lexicogrphicl order. This order my e disrupted y joining tuples from e DNA 1 nd e DNA 2. Therefore it is necessry to reorder the ttriute-vlue pirs within ech tuple resulting from e DNA. Shuffling ttriute-vlue pirs round in tuple is done using new technique we cll doule ridging. Insted of using single sticky ridge, two sticky ridges re hyridized onto one chin. A creful plcement of the ridges llows us to cut twice in the chin without seprting prts from the chin. Moreover, the two ridges guide the chin into its new conformtion. Next we descrie (in outline) DNAQL progrm for shuffling some ttriute C to the end of chin. Assume tht A is the first ttriute, ttriute B occurs just in front of C, C is the ttriute tht we wnt to move, D occurs exctly fter C nd E is the lst ttriute of the chin. The generl outline of the progrm is: 1. Insert the first mrker (# 6 # 7 ) etween ttriutes B nd C. 2. Insert the second mrker (# 8 # 9 ) etween ttriutes C nd D. 3. Insert the third mrker (# 9 # 1 ) t the end of the chin. 4. Add the two ridges to the mix: # 6 # 8 nd # 1 # 7. 5. Cut t # 6 nd # 8 nd ligte the resulting complex. 6. Remove the mrkers from the chins. An illustrtion is in Figure 7. A detiled DNAQL progrm to do these steps will hve similr structure to progrm e. 13

# 1 # 1 # 1 # 7 # 7 # 7 # 6 # 8 # 6 # 8 # 6 # 8 # 6 # 8 # 1 # 1 # 7 # 7 # 6 # 8 # 6 # 8 # 1 # 7 # 7 # 6 # 8 # 1 Figure 7: Illustrtion of steps 1 3 (top left); step 4 (top right); nd step 5 (ottom left, which simplifies to ottom right) descried in the proof of simultion of Crtesin product. Projection. Let e e of the form π C (e 1 ), where the reltion schem of e 1 is R. Assume tht B is the ttriute just in front of C nd D is the ttriute just fter ttriute C. In the cse tht ttriute C is the first ttriute of the reltion schem R, B is the lst ttriute of R. Likewise in the cse tht ttriute C is the lst ttriute of R, then D is the first ttriute of R. We thus perceive R to e circulr. Assume tht A nd E re the first resp. lst ttriute of R. We define e DNA s the following progrm: where let x := e DNA 1 in if empty(x) then empty else f 1 f 1 := clenup(split(lockfromto(clenup(ligte( f 2 )),E,A),# 4 )) f 2 := circulrize( f 3,D,B) f 3 := clenup(split(lockfromto(clenup(ligte( f 4 )),B,D),# 4 )) f 4 := circulrize(x,a,e) Renming. Let e e of the form ρ C/F (e 1 ), where R is the reltion schem of e 1. Simulting renming involves the following steps: 1. Rotte the chins to get ttriute C t the strt of ech chin. 2. Cut the ttriute from the chin, leving the vlues of C on the chin. 3. Add the F ttriute using stickers. 4. Rotte the chins gin to get the first ttriute t the strt of ech chin. Assume tht ttriute B occurs just in front of C, D just fter C, A is the first ttriute of R nd E is the lst ttriute. Then e DNA is the following progrm: let x := e DNA 1 in if empty(x) then empty else f 1 14

where f 1 := clenup(split(lockfromto( f 2,E,A),# 4 )) f 2 := clenup(ligte(hyridize[ f 3 # 2 F # 4 # 2 F# 3 ])) f 3 := hyridize(split(lockfromto( f 4,B,D),# 3 ) immo(# 3 )) f 4 := clenup(split(lockfromto(clenup(ligte( f 5 )),B,D),# 2 )) x 5 := circulrize(x,a,e) This progrm is not yet fully correct s ttriute F my need to e shuffled into the right plce. This cn e done y repetedly pplying the shuffle procedure descried in the cse of crtesin product. Selection. Let e e of the form σ B=D (e 1 ), where R is the reltion schem of e 1. Trnslting the selection opertor requires the most complicted expressions thus fr. Assume tht reltion schem R hs A s its first ttriute, C following directly ehind B, E following directly fter D nd F the lst ttriute of the schem. The Λ is fixed. The numer of tomic vlue symols is thus constnt; we denote them y v 1 to v n. Note A = B, or C = D or D = E = F is possile; the progrm will still function correctly. We define e DNA s follows: where let x := e DNA 1 in if empty(x) then empty else for x s := x iter i do e e := clenup(split(lockfromto(let x c := circulrize(x s,a,f) in e,f,a),# 4 )) e := select D v 1 (select B v 1 (x c )) select D v n (select B v n (x c )) select B (x ) := clenup(flush(hyridize(e 1(x )))) e 1(x ) := lockexcept(lockfromto(x,b,c),i) immo() select F (x ) := clenup(flush(hyridize(e 2(x )))) e 2(x ) := lockexcept(lockfromto(x,d,e),i) immo() 9 Concluding Remrks Mny interesting questions remin open. A first issue is tht n ritrry DNAQL progrm my not evlute on ll possile inputs. We would like to hve type system y which progrms cn e stticlly typechecked to e sfe on inputs of given types. We would lso like to etter understnd the expressive power of DNAQL. The reltionl lger provides lower ound on this expressive power. Wht is n upper ound? Cn the semntics of DNAQL e defined in first-order logic? Wht is the computtionl complexity of DNAQL? Also, re ll opertions nd constructs of DNAQL relly primitive in the lnguge, or cn some of them e simulted using the others? Another interesting issue is the reltionship etween DNAQL nd grph grmmrs. Furthermore, we could consider extensions, or restrictions, of DNAQL, just this hs een done for the reltionl lger. Extensions cn led to greter expressive power, while restrictions my led to decidle sttic verifiction prolems, such s testing the equivlence of DNAQL progrms. Finlly, while we hve gone to gret efforts to design n strction tht is s plusile s possile, of course, it would e gret if it could e experimentlly verified if DNAQL is workele for prcticl DNA computing. 15

References [1] S. Aiteoul, R. Hull, nd V. Vinu. Foundtions of Dtses. Addison-Wesley, 1995. [2] L.M. Adlemn. Moleculr computtion of solutions to comintoril prolems. Science, 226:1021 1024, Novemer 1994. [3] M. Amos. Theoreticl nd Experimentl DNA Computtion. Springer, 2005. [4] M. Arit, M. Hgiy, nd A. Suym. Joining nd rotting dt with molecules. In Proceedings 1997 IEEE Interntionl Conference on Evolutionry Computtion, pges 243 248. [5] D. Boneh, C. Dunworth, R.J. Lipton, nd J. Sgll. On the computtionl power of DNA. Discrete Applied Mthemtics, 71:79 94, 1996. [6] L. Crdelli. Strnd lgers for DNA computing. In Deton nd Suym [9], pges 12 24. [7] J. Chen, R.J. Deton, nd Y.-Z. Wng. A DNA-sed memory with in vitro lerning nd ssocitive recll. Nturl Computing, 4(2):83 101, 2005. [8] A.E. Condon, R.M. Corn, nd A. Mrthe. On comintoril DNA word design. Journl of Computtionl Biology, 8(3):201 220, 2001. [9] R.J. Deton nd A. Suym, editors. Proceedings 15th Interntionl Meeting on DNA Computing, volume 5877 of Lecture Notes in Computer Science. Springer, 2009. [10] L. Ditchenko, Y.F. Lu, et l. Suppression sutrctive hyridiztion: method for generting differentilly regulted or tissue-specific cdna proes nd lirries. Proceedings of the Ntionl Acdemy of Sciences, 93(12):6025 6030, 1996. [11] R.M. Dirks nd N.A. Pierce. Triggered mplifiction y hyridiztion chin rection. Proceedings of the Ntionl Acdemy of Sciences, 101(43):15275 15278, 2004. [12] N. Immermn. Descriptive Complexity. Springer, 1999. [13] Q. Liu, L. Wng, et l. DNA computing on surfces. Nture, 403:175 179, 1999. [14] R.B. Lyngsø. Complexity of pseudoknot prediction in simple models. In J. Díz et l., editors, Proceedings 31st Interntionl Colloquium on Automt, Lnguges nd Progrmming, volume 3142 of Lecture Notes in Computer Science, pges 919 931. Springer, 2004. [15] U. Mjumder nd J.H. Reif. Design of iomoleculr device tht executes process lger. In Deton nd Suym [9], pges 97 105. [16] G. Pun, G. Rozenerg, nd A. Slom. DNA Computing. Springer, 1998. [17] R. Rmkrishnn nd J. Gehrke. Dtse Mngement Systems. McGrw-Hill, 2002. [18] J.H. Reif. Prllel iomoleculr computtion: models nd simultions. Algorithmic, 25(2 3):142 175, 1999. [19] J.H. Reif et l. Experimentl construction of very lrge scle DNA dtses with ssocitive serch cpility. In N. Jonosk nd N.C. Seemn, editors, Proceedings 7th Interntionl Meeting on DNA Computing, volume 2340 of Lecture Notes in Computer Science, pges 231 247. Springer, 2002. [20] G. Rozenerg nd H. Spink. DNA computing y locking. Theoreticl Computer Science, 292:653 665, 2003. [21] J. Sger nd D. Stefnovic. Designing nucleotide sequences for computtion: A survey of constrints. Lecture Notes in Computer Science, 3892:275 289, 2006. [22] K. Skmoto et l. Stte trnsitions y molecules. Biosystems, 52:81 91, 1999. [23] G. Seelig, D. Soloveichik, D.Y. Zhng, nd E. Winfree. Enzyme-free nucleic cid logic circuits. Science, 315(5805):1585 1588, 2006. [24] M.R. Shortreed et l. A thermodynmic pproch to designing structure-free comintoril DNA word sets. Nucleic Acids Reserch, 33(15):4965 4977, 2005. [25] J. Vn den Bussche, D. Vn Gucht, nd S. Vnsummeren. A crsh course in dtse queries. In Proceedings 26th ACM Symposium on Principles of Dtse Systems, pges 143 154. ACM Press, 2007. [26] M. Ymmoto et l. Development of DNA reltionl dtses nd dt mnipultion experiments. In C. Mo nd T. Yokomori, editors, Proceedings 12th Interntionl Meeting on DNA Computing, volume 4287 of Lecture Notes in Computer Science, pges 418 427. Springer, 2006. 16